# Estimation of Model in Chapter 2 of Thesis

In [2]:
import pandas as pd
import numpy as np
from scipy.optimize import minimize as MIN, show_options as SO
import time
from collections import Counter
%load_ext Cython

### File Names for the dataframe

In [3]:
#File path of STATA dataset

Path_Data = "/Users/idiosyncrasy58/Dropbox/Documents/College/"+ \
            "Universitat Autonoma de Barcelona/IDEA - Economics/"+ \
            "Doctoral Thesis Ideas/Migration/IFLS/Project Files/"+ \
            "Temp Files/Longitudinal Adult Children Data for Estimation.dta"

In [4]:
index = {1:'Low-Skilled, Everywhere Else',2:'Low-Skilled, Java',
         3:'High-Skilled, Everywhere Else',4:'High-Skilled, Java'}
 
col_keep = ['pidlink','sex','age','MaxSchYrs','ParentalSchAvg','MarketCode','InterMarket_FamilyMig',
            'Skill_Level_2','Skill_Level_2_Parents','Wage_2_HH','Educ_3']

#Read in the file
Data = pd.read_stata(Path_Data,columns=col_keep).rename(columns={'pidlink':'Household'})

Data.head()

Unnamed: 0,Household,sex,age,MaxSchYrs,ParentalSchAvg,MarketCode,InterMarket_FamilyMig,Skill_Level_2,Skill_Level_2_Parents,Wage_2_HH,Educ_3
0,1220003,1.0,19,13,3.0,1,0,1,0,1.047,1
1,1250003,1.0,14,3,4.0,1,0,0,0,1.047,0
2,1290003,3.0,14,4,0.0,1,0,0,0,1.047,0
3,2010007,3.0,17,13,7.0,1,0,1,0,1.047,1
4,2040003,3.0,14,6,0.5,1,0,0,0,1.047,0


In [5]:
#Create the Variables highlighting where the parents moved to

def Market_Move(Curr_Loc, Choice):
    
    if Curr_Loc==1 and Choice==1:
        Move = 2
    elif Curr_Loc==2 and Choice==1:
        Move = 1
    elif Choice==0:
        Move = Curr_Loc
    
    return Move
    
def State_Def(Loc, Skill):
    
    if Loc==1 and Skill==0:
        State = 1
    elif Loc==1 and Skill==1:
        State = 3
    elif Loc==2 and Skill==0:
        State = 2
    else: State = 4
        
    return State

def Dec_Def(Loc_Choice, Educ_Choice):
    
    if Loc_Choice==0 and Educ_Choice==0:
        Decision = 1
    elif Loc_Choice==0 and Educ_Choice==1:
        Decision = 2
    elif Loc_Choice==1 and Educ_Choice==0:
        Decision = 3
    else: Decision = 4
        
    return Decision

Data['MarketCode_Move'] = Data.apply(lambda row: 
                                     Market_Move(row['MarketCode'],row['InterMarket_FamilyMig']), 
                                     axis=1)

Data['Parent_State'] = Data.apply(lambda row: 
                                     State_Def(row['MarketCode'],row['Skill_Level_2_Parents']), 
                                     axis=1)

Data['Decision'] = Data.apply(lambda row: 
                                     Dec_Def(row['InterMarket_FamilyMig'],row['Educ_3']), 
                                     axis=1)

### Test the States for the Transition Functions

For the simple model we will consider the possibility that children are educated but since the 'amount' of education parents invested in their child may not reach compulsory education (up to grade 9 or more), then those who do not reach this threshold are considered 'low skilled' --> so that parents' education decision may not lead to a high skilled outcome. 

That being said, if parents give their child less than 9 years of education, then they should necessarily know that these children are not going to be high skilled. In which case, this is not ex ante expected but a deterministic outcome. If, on the other hand, I use the child's occupational choice to determine the skill of the child (through the ONET dataset), then parents can form an expectation on the child's skill level outcome from education, since some children with lower education may end up in a high skilled job, some children with more education may end up in a low skilled job. 

**Solution**: Go with the deterministic to make it simple

In [6]:
Data.loc[Data.InterMarket_FamilyMig==0].groupby(['Skill_Level_2_Parents','MarketCode','MarketCode_Move','Educ_3'])['Skill_Level_2'].mean()

Skill_Level_2_Parents  MarketCode  MarketCode_Move  Educ_3
0                      1           1                0         0
                                                    1         1
                       2           2                0         0
                                                    1         1
1                      1           1                0         0
                                                    1         1
                       2           2                0         0
                                                    1         1
Name: Skill_Level_2, dtype: int8

In [7]:
Data.loc[Data.InterMarket_FamilyMig==1].groupby(['Skill_Level_2_Parents','MarketCode','MarketCode_Move','Educ_3'])['Skill_Level_2'].mean()

Skill_Level_2_Parents  MarketCode  MarketCode_Move  Educ_3
0                      1           2                0         0
                                                    1         1
                       2           1                0         0
                                                    1         1
1                      1           2                0         0
                                                    1         1
                       2           1                0         0
                                                    1         1
Name: Skill_Level_2, dtype: int8

### Econometric Model Code

The following cells generate the estimation of the model. Likely, where possible, code will be written in Cython when necessary (for example: the inner loop and the maximization via GSL).

##### Model
I write the following value function from the point of view of the old-age agent, indexed by $(d,g,t=2)$ (dynasty $d$, generation $g$, and period of life $t$):
\begin{equation}
V_{d,g,t=2}(z,\varepsilon) = \max_{I_k\in I} \text{  } \sum_k I_{k}\left\{v_{d,g,t=2}(z,k) + \varepsilon_{k} \right\} \end{equation}
where
\begin{equation}
v_{d,g,t=2}(z,k)=u(c)+\alpha \text{E}\left[V_{d,g',t=2}(z',\varepsilon')\big|z,I_{k}=1\right]
\end{equation}

and utility, being linear and the same across generations, is $u(c)=c$.

##### Specification
The econometric specification of the model is the following:

Wages: Agents receieve the median wages offered in each market based on their skill-level (where I take the skill-level of the first generation as pre-determined). Agents choose at the beginning of the period where they want to live and pay the cost to move there. Simultaneously, they choose whether to educate their child or not in their chosen location. 

The budget constraint is given by:
\begin{equation}
c = w^{hh}(h,j) - \delta\cdot 1(\ell\neq j) - \phi_{j}\cdot 1(e=1)
\end{equation}

and 
\begin{equation}
w^{hh}(h,j)=med(w(h,j))
\end{equation}

#### Parameters

In [8]:
#Parameters and parameter vector to pass into function
alpha=0.99**18        #altruism parameter = 0.99^18 
                          #(18 years old when child is supposed to finish schooling)
tot_states=4          #number of states
tot_decisions=4       #number of decisions

#### State Space Variable

In [9]:
#Market Adult Wages Array
#rows:      regions
#columns:   skill levels

wage_R1_ls=1.047  #Everywhere Else
wage_R1_hs=2.513  #Everywhere Else
wage_R2_ls=1      #Island of Java
wage_R2_hs=2.626  #Island of Java

#Strucutre a wage array for quick access 
wage_lst=[[wage_R1_ls]*2+[wage_R2_ls]*2,
          [wage_R2_ls]*2+[wage_R1_ls]*2,
          [wage_R1_hs]*2+[wage_R2_hs]*2,
          [wage_R2_hs]*2+[wage_R1_hs]*2]

wages = np.array(wage_lst, dtype='d').reshape((tot_states,tot_decisions))

#### Transition Function

In [10]:
tran_st=[[1,0,0,0],[0,0,1,0],[0,1,0,0],[0,0,0,1],
         [0,1,0,0],[0,0,0,1],[1,0,0,0],[0,0,1,0],
         [1,0,0,0],[0,0,1,0],[0,1,0,0],[0,0,0,1],
         [0,1,0,0],[0,0,0,1],[1,0,0,0],[0,0,1,0]]

tran_func = np.array(tran_st, dtype='d').reshape((tot_states,tot_decisions,tot_states))

#### Function to permute the Costs for vectorization

In [11]:
def Perm_Param(Param):
    
    #Education costs
    educ_lst=[[0,Param[1],0,Param[2]],
              [0,Param[2],0,Param[1]]]*2

    educ_cost=np.array(educ_lst, dtype='d').reshape((tot_states,tot_decisions))

    #Moving Costs
    move_lst=[[0]*2+[Param[0]]*2]

    move_cost=np.array(move_lst, dtype='d')
    
    cost = educ_cost + move_cost #+ educ_cost*move_cost
    
    return cost

#### Function to get the CCPs from the value functions

In [12]:
def CCP(Param):
    
    #recast the cost paramaters into arrays
    cost = Perm_Param(Param)
    
    #Calculate the value function
    
    #Final Period: Expected lifetime Value based on Child(T=1) = Adult(T=2), Adult(T=1) = Effectively Dead

    Sum = 0

    for t in range(100):
        Sum += (1/(1/alpha)**t) * wages[:,0]
    
    V = np.log(Sum) + np.euler_gamma
    
    #V = np.log(np.exp(wages).sum(axis=1)) + np.euler_gamma
    
    #Initial Period: Adult(T=1) = Alive, Child(T=1)
    
    v = wages + cost + alpha*tran_func.dot(V)
    
    #Calculate the CCPs
    
    CCP = np.exp(v)/(np.exp(v).sum(axis=1).reshape(4,1))
    
    return (CCP,V)

#### Map CCPs to the data based on the decisions taken and the states of the individual

In [13]:
def CCP_Map(State, Decision, CCP):

    Choice_Prob = CCP[State-1,Decision-1]

    return np.log(Choice_Prob)

def CCP_Data(CCP):

    Data['CCP'] = Data.apply(lambda row: 
                             CCP_Map(row['Parent_State'],row['Decision'],CCP), axis=1)

#### Calculation of the Log-Likelihood Function

In [14]:
def LLF(Params, Data):
    
    #Solve the Dynamic Programming Problem
    CCPs, _ = CCP(Params)
    
    #Map the CCPs to the Data
    CCP_Data(CCPs)
    
    #Calculate the log-likelihood value
    LLF = -1*Data.CCP.sum()
    
    return LLF

### Estimation of the Model

#### Unconstrained Maximization of the LLF

Use the minimization routine with BFGS method.

In [15]:
time1 = time.time()

Param_Final = MIN(LLF, np.random.randn(3)*100, method='BFGS', args=(Data,), options={'disp': True, 'gtol':1e-4})

print(str(time.time()-time1)+' seconds')

  """
  grad[k] = (f(*((xk + d,) + args)) - f0) / d[k]
  """
  """
  grad[k] = (f(*((xk + d,) + args)) - f0) / d[k]
  """
  grad[k] = (f(*((xk + d,) + args)) - f0) / d[k]
  """


Optimization terminated successfully.
         Current function value: 2673.486663
         Iterations: 21
         Function evaluations: 284
         Gradient evaluations: 55
25.490896940231323 seconds


#### Results

Optimal Parameters

In [16]:
print(Param_Final.x)

[-4.79304101 -0.60968289 -0.72229798]


Standard Errors

In [17]:
std_err = np.sqrt(np.diag(Param_Final.hess_inv))
print(std_err)

[ 0.18870537  0.05290067  0.0450618 ]


t-statistics

In [18]:
t = Param_Final.x/std_err
print(abs(t))

[ 25.39960053  11.52505068  16.02905352]


#### Compare the Estimated CCPs with the Empirical CCPs

In [19]:
#Group the data according to the states and then collapse the data along the desired dimensions.

function = {'Household':'count', 'InterMarket_FamilyMig':'mean', 'Educ_3':'mean'}

Stats = Data.groupby('Parent_State').agg(function)

Stats['Prop_in_State'] = Stats.Household/len(Data)

Stats.rename(index=index)

Unnamed: 0_level_0,Household,InterMarket_FamilyMig,Educ_3,Prop_in_State
Parent_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Low-Skilled, Everywhere Else",1245,0.009639,0.481124,0.344684
"Low-Skilled, Java",1546,0.002587,0.459897,0.428018
"High-Skilled, Everywhere Else",322,0.021739,0.742236,0.089147
"High-Skilled, Java",499,0.014028,0.695391,0.138151


Generate the matrix of the Empirical CCPS from the above.

Assumption on mutually independent choices implies that:

\begin{equation}
P(Move=\{0,1\} \cap Educ=\{0,1\}\mid State=S)=P(Move=\{0,1\}\mid State=S)\cdot P(Educ=\{0,1\}\mid Move=\{0,1\} \cap State=S)
\end{equation}

So we can multiply the above probabilities for 'InterMarket_FamilyMig' and 'Educ_3' (as well as their respective compliments) since they are conditional on the State due to the grouping.

In [56]:
#Group the data according to the states and then collapse the data along the desired dimensions.

#Calculate the first conditional probability P(Educ={0,1} | Move={0,1}, State=S)
Educ_DF = Data.groupby(['Parent_State','InterMarket_FamilyMig']).agg({'Educ_3':'mean'})
Educ_DF['No_Educ'] = 1 - Educ_DF.Educ_3
Educ_DF.sort_index(axis=1, ascending=False, inplace=True)

#Calculate the second conditional probability P(Move={0,1}|State=S)
Mig_DF = Data.groupby('Parent_State').agg({'InterMarket_FamilyMig':'mean'})
Mig_DF['No_Mig'] = 1 - Mig_DF.InterMarket_FamilyMig
Mig_DF.sort_index(axis=1, ascending=False, inplace=True)

#Get the underlying numpy array from the DataFrame, reshape to broadcast multiplication
Mig_Prob = Mig_DF.loc[:,['No_Mig','InterMarket_FamilyMig']].get_values().reshape((8,1))

Educ_Prob = Educ_DF.loc[:,['No_Educ','Educ_3']].get_values()

#Create the CCP DataFrame
Columns = ['NoEduc_NoMig','Educ_NoMig','NoEduc_Mig','Educ_Mig']

Emp_CCPs = pd.DataFrame((Educ_Prob*Mig_Prob).reshape((tot_states,tot_decisions)),columns=Columns)

Emp_CCPs

Unnamed: 0,NoEduc_NoMig,Educ_NoMig,NoEduc_Mig,Educ_Mig
0,0.514056,0.476305,0.004819,0.004819
1,0.539457,0.457956,0.000647,0.00194
2,0.254658,0.723602,0.003106,0.018634
3,0.300601,0.685371,0.004008,0.01002


The estimated CCPs can be obtained by plugging back in the estimated parameters of the model into the function that calculates the CCPs from the value functions

In [21]:
Est_CCPs, V = CCP(Param_Final.x)
print(Est_CCPs)

[[ 0.46631429  0.52628036  0.00354838  0.00385698]
 [ 0.47479312  0.51608543  0.0042852   0.00483626]
 [ 0.46571588  0.525605    0.00415872  0.0045204 ]
 [ 0.47543432  0.5167824   0.00365654  0.00412675]]


The absolute differences between the CCPs

## Simulation of the Model

### Cython code for generating an array of states based on the given cumulative distribution

The below code is just a test to create a Cumulative Distribution Function in Cython/C

In [22]:
%%cython

#!python
#cython: boundscheck=False,wraparound=False,nonecheck=False,cdivision=True

#Cython function to generate the next state based on a distribution of  probabilities

from libc.stdlib cimport rand, RAND_MAX

cdef Py_ssize_t tot_states

#A random number generator between 0 and 1
cdef inline double rand_state() nogil:  
    return rand()/<double>RAND_MAX

#This function will determine the index of the array of the cumulative 
#distribution of probabilities 
cdef unsigned int find_interval(double x, double *arr) nogil:
    cdef Py_ssize_t i
    
    for i in range(tot_states):
        if x<arr[i]:
            return i
            
    return 0

#This function will generate the state from the sampling distribution of probabilities
cdef unsigned int next_state(double[:] tran):
    cdef:
        double x
        unsigned int index
        
    x = rand_state()

    index = find_interval(x, &tran[0]) + 1

    return index    

#This function will fill in an array with states based on the cumulative sampling distribution 
#of probabilities
def Sim_States(double[:] Sample_Dist, int[:] output):
    
    global tot_states
    
    tot_states = Sample_Dist.shape[0]
    
    cdef:
        Py_ssize_t i, HH = output.shape[0]
        
    #Fill the array with the states
    for i in range(HH):
        output[i] = next_state(Sample_Dist)        

Create the Iniital distribution of households (10,000 households)

In [23]:
#Initialize the Simulated Household Array
Sim_Data = np.zeros(10000, dtype='int32')

In [24]:
#Pass into the function the values from the Stats dataframe 
#related to the frequencies of states we observe in the data as a numpy array,
#and take the cumulative sum of the array to pass into the Cython function

Sim_States(Stats.Prop_in_State.get_values().cumsum(),Sim_Data)

Check the distribution of the HH simulated states

In [25]:
c = Counter(Sim_Data.tolist())

for key in c:
    c[key] = c[key] / float(len(Sim_Data))

print(c)

Counter({2: 0.4301, 1: 0.3411, 4: 0.1414, 3: 0.0874})


### Cython Code for Simulation of the model based on the parameters

The below code is adapted from the code created in the simulation code folder

In [26]:
%%cython -lgsl -lgslcblas

#!python
#cython: boundscheck=False,wraparound=False,nonecheck=False,cdivision=True

# Cython code to optimise in C the Simulation of the model portion of the code

##################### Import Modules and math functions ######################

#Cython and C functions (this is faster than calling external C function math libs)
cimport cython

from libc.stdlib cimport rand, RAND_MAX, malloc, calloc, free, abort
from libc.math cimport HUGE_VAL

#Use the CythonGSL package to get the random number gen at low-level
from cython_gsl cimport *

####################### Assign the global variables ##########################

#These will be passed into functions automatically without 
#having to call them up explicitely

cdef Py_ssize_t HH, tot_states, tot_decisions, Gen

##############################################################################
####### Define the functions that will assist the simulation module ##########
##############################################################################

############ Random Numbers, Random States, and Random Shocks functions

#Random number generator on interval [0,1]
cdef inline double rand_value() nogil:
    return rand()/<double>RAND_MAX

############# Choice Specific Values assisting functions

#Define the inner-array product, releasing the gil of the function
cdef double dot( double[:] a, double[:] b ) nogil:
    cdef:
        double result=0
        Py_ssize_t i, dim=a.shape[0]

    for i in range(dim):
        result += a[i]*b[i]
    return result

#This function will output the decision based on max value
cdef Py_ssize_t Compare(double* arr, Py_ssize_t curr_hh) nogil:
    
    #declare variable types
    cdef:
        Py_ssize_t dec=0, i
        double v_temp, MAX=(-1)*HUGE_VAL

    #grab the max of the choice specific value for the current household:
    for i in range(1,tot_decisions+1):
        v_temp = arr[(i-1) + curr_hh*tot_decisions]
        if v_temp > MAX: 
            #update the max
            MAX = v_temp
            #capture current index
            dec = i

    return dec


############### Function and auxiliaries determining the next state

#This function rewrites array with the cumulative sum through recursion
cdef void cum_sum(double *arr, size_t index=4-1) nogil:
    if index<=0: return
    cum_sum(arr, index-1)
    arr[index] += arr[index-1]

#This function will determine the index of the transition function 
#based on the cumulative probabilities 
cdef unsigned short find_interval(double x, double *arr) nogil:
    cdef Py_ssize_t i
    
    for i in range(tot_states):
        if x<arr[i]:
            return <unsigned short>i

#This function will generate the next state based on the transition
#function probabilites (a discrete value)
cdef unsigned short Next_State(double[:] tran) nogil:
    cdef:
        double x
        double *array
        unsigned short index
        Py_ssize_t i
    
    array=<double*> calloc(tot_states, sizeof(double))
    
    if array==NULL: abort()

    try:
        #generate a random number to help determine the next state
        x = rand_value()
        
        #copy the transition function values into the array to prevent rewrite
        for i in range(tot_states):
            array[i]=tran[i]
        
        #rewrite the array into the cumulative sum of the elements
        cum_sum(array)
        
        #the next state is the return value of the function
        #(the array index) + 1 to create the next state
        index = find_interval(x, array) + 1
    
        return index

    finally:
        free(array)  


################### Function for filling in the Simulated data array

#This function will calculate the frequency of decisions for each generation
cdef void Data(unsigned short* Data, Py_ssize_t curr_gen, 
               Py_ssize_t curr_hh, Py_ssize_t dec, Py_ssize_t state, 
               unsigned short* next_state_arr) nogil:
    
    #fill in the state
    Data[0 + (curr_hh + curr_gen*HH)*4] = <unsigned short>state
    Data[3 + (curr_hh + curr_gen*HH)*4] = next_state_arr[curr_hh]
    
    #fill in the moving decision
    if dec==3 or dec==4:
        Data[1 + (curr_hh + curr_gen*HH)*4] = 1
    #fill in the education decision
    elif dec==2 or dec==4:
        Data[2 + (curr_hh + curr_gen*HH)*4] = 1


############ Function defining the simulation of the model ################
def Sim_Model(double[:] V, double alpha, double[:,:] wages, 
              double[:,:] cost, double[:,:,:] tranny, 
              unsigned short[:,:,:] Sim_Data, double[:] init_states):
    
    #declare and assign the globals
    global HH, Gen, tot_states, tot_decisions
    
    HH = Sim_Data.shape[1]
    tot_states = V.shape[0]            #Dimension of the states is given by the number of rows in the V array
    tot_decisions = tranny.shape[1]    #Dimension of the decisions is diven by the rows of one of the trans arrays
    Gen = Sim_Data.shape[0]
    
    #declare the types for variables and arrays
    cdef:
        Py_ssize_t decision, state
        
        #define array types
        unsigned short* states
        double* v_sim
        
        #define the shock array
        gsl_rng* r
        
        #define iterators
        Py_ssize_t i, j, k
    
    #allocate arrays
    states = <unsigned short*> calloc(HH, sizeof(unsigned short))
    v_sim = <double*> calloc(HH*tot_states, sizeof(double))
    r = gsl_rng_alloc(gsl_rng_mt19937) #use the MT19937 algorithm for prng
        
    #check that memory was allocated:
    if states==NULL or v_sim==NULL or r==NULL: abort() 
        
    #simulate the model
    try:
        #for initial generation, replace with random states generated from given distribution
        for j in range(HH):
            states[j]=Next_State(init_states) 

        #outerloop are the generations (make sure that we skip the last generation - they
        #make no decisions - so start iterator at 1 and not 0)
        for i in range(Gen):

            #inner loop the households (should be parallelizable)
            for j in range(HH):

                #grab the household's state from the matrix
                state = states[j]

                for k in range(tot_decisions):

                    #calculate choice specific value functions
                    v_sim[k+j*tot_states] = wages[state-1,k] + cost[state-1,k] + \
                                            gsl_ran_gumbel1(r,1,1) + alpha*dot(tranny[state-1,k,:],V)

                #compare values, return the decision (index+1)
                decision = Compare(v_sim,j)

                #rewrite the state array with the next generation's value:
                states[j] = Next_State(tranny[state-1,decision-1,:])
        
                #save the decisions of the houshold
                Data(&Sim_Data[0,0,0],i,j,decision,state,states)

    finally:
        free(v_sim)
        free(states)
        gsl_rng_free(r)

### Python code for the execution of the simulation

Define the inputs of the simulation (use the previously defined functions to generate the cost array and the V array)

In [27]:
#Define the initial states of the model to follow that of :

#Grab the initial distribution of states from the data
init_states = Stats.Prop_in_State.get_values() #np.array([0,0,0,1.0])

#The cost matrix based on the final parameter values
cost = Perm_Param(Param_Final.x)

#Define and initialize the simulated Data array that will be filled in with 
#the results from the simulation

#number of generations
Gen = 2
#number of households
HH = 100000

#Simulated Data to be filled by the simulator
Sim_Data = np.zeros((Gen,HH,4), dtype='uint16')

Run the simulation and time it

In [28]:
time1 = time.time()

Sim_Model(V,alpha,wages,cost,tran_func,Sim_Data,init_states)

print('The model took '+str(time.time()-time1)+' seconds to simulate.')

The model took 0.08794498443603516 seconds to simulate.


## Analyze Results from Simulation

Place the simulated data into a data frame to generate the statistics

In [29]:
Data_Sim = (pd.DataFrame(Sim_Data[0,:,:], columns=['Parent_State','Migrate','Educate','Child_State'])
              .reset_index()
              .rename(columns={'index':'Household'}) )

### Aggregate Statistics

Across whole Simulated dataset

In [30]:
print(Data_Sim.loc[:,['Migrate','Educate']].sum(axis=0)/len(Data_Sim))

Migrate    0.00859
Educate    0.51870
dtype: float64


Across the whole Empirical Dataset

In [31]:
print(Data.loc[:,['InterMarket_FamilyMig','Educ_3']].sum(axis=0)/len(Data))

InterMarket_FamilyMig    0.008306
Educ_3                   0.524917
dtype: float64


We see that unconditional means are well simulated

### By State (Person's Skill and Location)

From Simulation

In [32]:
function = {'Household':'count','Migrate':'mean', 'Educate':'mean'}

Sim_Stats = Data_Sim.groupby('Parent_State',as_index=True).agg(function)
Sim_Stats['Prop_in_State'] = Sim_Stats.Household/len(Data_Sim)

Merge with the original dataset for comparison

In [33]:
(Sim_Stats.loc[:,['Migrate','Educate']].merge(Stats.loc[:,['InterMarket_FamilyMig','Educ_3']],
                                              left_index=True,right_index=True,copy=False)
                                       .rename(index=index))

Unnamed: 0_level_0,Migrate,Educate,InterMarket_FamilyMig,Educ_3
Parent_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Low-Skilled, Everywhere Else",0.00785,0.524348,0.009639,0.481124
"Low-Skilled, Java",0.009423,0.514508,0.002587,0.459897
"High-Skilled, Everywhere Else",0.00912,0.522411,0.021739,0.742236
"High-Skilled, Java",0.007496,0.515211,0.014028,0.695391


Unfortunately, conditional means on the decisions are note well modeled. Will need to work on this

### Next Generation States

From the simulation

In [34]:
Data_Sim_Child = (pd.DataFrame(Sim_Data[-1,:,0], columns={'Child_State':'0'})
                    .reset_index()
                    .rename(columns={'index':'Household'}) )

Sim_Stats_Child = Data_Sim_Child.groupby('Child_State',as_index=True).agg({'Household':'count'})
Sim_Stats_Child['Prop_in_State_Sim'] = Sim_Stats_Child.Household/len(Data_Sim_Child)

From the data

In [35]:
Data['Child_State'] = Data.apply(lambda row: 
                                     State_Def(row['MarketCode_Move'],row['Skill_Level_2']), 
                                     axis=1)

Stats_Child = Data.groupby('Child_State').agg({'Household':'count'})
Stats_Child['Prop_in_State_Data'] = Stats_Child.Household/len(Data)

Merge restuls

In [36]:
(Sim_Stats_Child.merge(Stats_Child,left_index=True,
                       right_index=True,copy=False)
                .drop(['Household_x','Household_y'], inplace=False, axis=1)
                .rename(index=index))

Unnamed: 0_level_0,Prop_in_State_Sim,Prop_in_State_Data
Child_State,Unnamed: 1_level_1,Unnamed: 2_level_1
"Low-Skilled, Everywhere Else",0.20534,0.20072
"Low-Skilled, Java",0.27135,0.274363
"High-Skilled, Everywhere Else",0.23009,0.230897
"High-Skilled, Java",0.29322,0.29402


The model does capture the conditional distribution of states of the next generation (the children)

### Transition from Parent States to Child States

#### Flattened Transitions

Simulation Results

In [37]:
State_Tran_Sim = ( Data_Sim.groupby(['Parent_State','Child_State'],as_index=True)
                           .agg({'Household':'count'})
                           .rename(index=index) )

State_Tran_Sim['Prop_in_State_Sim'] = State_Tran_Sim.Household/len(Data_Sim)

Data Results

In [38]:
State_Tran_Data = (Data.groupby(['Parent_State','Child_State'],as_index=True)
                       .agg({'Household':'count'})
                       .rename(index=index) )

State_Tran_Data['Prop_in_State_Data'] = State_Tran_Data.Household/len(Data)

Merge the results

In [39]:
(State_Tran_Sim.merge(State_Tran_Data, how='inner', 
                      left_index=True, right_index=True, copy=False)
                .drop(['Household_x','Household_y'], axis=1, inplace=False))

Unnamed: 0_level_0,Unnamed: 1_level_0,Prop_in_State_Sim,Prop_in_State_Data
Parent_State,Child_State,Unnamed: 2_level_1,Unnamed: 3_level_1
"Low-Skilled, Everywhere Else","Low-Skilled, Everywhere Else",0.16091,0.177187
"Low-Skilled, Everywhere Else","Low-Skilled, Java",0.00136,0.001661
"Low-Skilled, Everywhere Else","High-Skilled, Everywhere Else",0.18036,0.164175
"Low-Skilled, Everywhere Else","High-Skilled, Java",0.00134,0.001661
"Low-Skilled, Java","Low-Skilled, Everywhere Else",0.00184,0.000277
"Low-Skilled, Java","Low-Skilled, Java",0.2041,0.230897
"Low-Skilled, Java","High-Skilled, Everywhere Else",0.0022,0.000831
"Low-Skilled, Java","High-Skilled, Java",0.22058,0.196013
"High-Skilled, Everywhere Else","Low-Skilled, Everywhere Else",0.04212,0.022702
"High-Skilled, Everywhere Else","Low-Skilled, Java",0.00031,0.000277


#### Using Pandas Crosstabs
Repeat the above exercises but using cross tabs to create a comparison table

In [40]:
Data_Cross = pd.crosstab(Data.Parent_State,Data.Child_State,normalize='index').rename(index=index, columns=index)

In [41]:
Sim_Cross = pd.crosstab(Data_Sim.Parent_State,Data_Sim.Child_State,normalize='index').rename(index=index, columns=index)

In [42]:
Uncond_State_Tran = ( Sim_Cross.merge(Data_Cross,left_index=True, right_index=True,suffixes=('_Sim','_Data'))
                             .sort_index(axis=1, ascending=False)
                             .sort_index(axis=0, ascending=False) )
del Data_Cross, Sim_Cross

Uncond_State_Tran

Child_State,"Low-Skilled, Java_Sim","Low-Skilled, Java_Data","Low-Skilled, Everywhere Else_Sim","Low-Skilled, Everywhere Else_Data","High-Skilled, Java_Sim","High-Skilled, Java_Data","High-Skilled, Everywhere Else_Sim","High-Skilled, Everywhere Else_Data"
Parent_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Low-Skilled, Java",0.476068,0.539457,0.004292,0.000647,0.514508,0.457956,0.005132,0.00194
"Low-Skilled, Everywhere Else",0.003954,0.004819,0.467802,0.514056,0.003896,0.004819,0.524348,0.476305
"High-Skilled, Java",0.477293,0.300601,0.003421,0.004008,0.515211,0.685371,0.004076,0.01002
"High-Skilled, Everywhere Else",0.003448,0.003106,0.468468,0.254658,0.005672,0.018634,0.522411,0.723602


The above DataFrame is the persistency between generations. We see that the model generally does will with the low-skilled cohorts in Generation G (the rows), but does not model well the incentive to invest in education among the high skilled cohorts. 

#### Conditional Transitions: Conditional on Migration (or not Migrating), the Relative Frequencies of Transitions from Parental State to Children entering into the Low or High Skilled States

##### Data

In [43]:
#Data: State Transitions Conidtional on Migration, the transition between states
((Data[Data.InterMarket_FamilyMig==1]
 .groupby(['Parent_State','Child_State'],as_index=True)
 .agg({'Household':'count'})
 .rename(index=index))
 .Household) / \
np.repeat(Data[Data.InterMarket_FamilyMig==1]
          .groupby(['Parent_State'],as_index=False)
          .agg({'Household':'count'})
          .Household
          .get_values(),[2]*4,axis=0)

Parent_State                   Child_State                  
Low-Skilled, Everywhere Else   Low-Skilled, Java                0.500000
                               High-Skilled, Java               0.500000
Low-Skilled, Java              Low-Skilled, Everywhere Else     0.250000
                               High-Skilled, Everywhere Else    0.750000
High-Skilled, Everywhere Else  Low-Skilled, Java                0.142857
                               High-Skilled, Java               0.857143
High-Skilled, Java             Low-Skilled, Everywhere Else     0.285714
                               High-Skilled, Everywhere Else    0.714286
Name: Household, dtype: float64

In [44]:
#Data: State Transtions Conditional on staying 
((Data[Data.InterMarket_FamilyMig==0]
 .groupby(['Parent_State','Child_State'],as_index=True)
 .agg({'Household':'count'})
 .rename(index=index))
 .Household) / \
np.repeat(Data[Data.InterMarket_FamilyMig==0]
          .groupby(['Parent_State'],as_index=False)
          .agg({'Household':'count'})
          .Household
          .get_values(),[2]*4,axis=0)

Parent_State                   Child_State                  
Low-Skilled, Everywhere Else   Low-Skilled, Everywhere Else     0.519059
                               High-Skilled, Everywhere Else    0.480941
Low-Skilled, Java              Low-Skilled, Java                0.540856
                               High-Skilled, Java               0.459144
High-Skilled, Everywhere Else  Low-Skilled, Everywhere Else     0.260317
                               High-Skilled, Everywhere Else    0.739683
High-Skilled, Java             Low-Skilled, Java                0.304878
                               High-Skilled, Java               0.695122
Name: Household, dtype: float64

Repeat the above using pandas crosstabs

In [45]:
Data_Cross_Mig = ( pd.crosstab(Data[Data.InterMarket_FamilyMig==1].Parent_State,Data.Child_State,normalize='index')
                     .rename(index=index, columns=index) )

In [46]:
Data_Cross_NoMig = ( pd.crosstab(Data[Data.InterMarket_FamilyMig==0].Parent_State,Data.Child_State,normalize='index')
                     .rename(index=index, columns=index) )

##### Simulation

In [47]:
#Simulation: State Transitions Conidtional on Migration, the transition between states
((Data_Sim[Data_Sim.Migrate==1]
  .groupby(['Parent_State','Child_State'],as_index=True)
  .agg({'Household':'count'})
  .rename(index=index))
  .Household) / \
np.repeat(Data_Sim[Data_Sim.Migrate==1]
          .groupby(['Parent_State'],as_index=False)
          .agg({'Household':'count'})
          .Household
          .get_values(),[2]*4,axis=0)

Parent_State                   Child_State                  
Low-Skilled, Everywhere Else   Low-Skilled, Java                0.503704
                               High-Skilled, Java               0.496296
Low-Skilled, Java              Low-Skilled, Everywhere Else     0.455446
                               High-Skilled, Everywhere Else    0.544554
High-Skilled, Everywhere Else  Low-Skilled, Java                0.378049
                               High-Skilled, Java               0.621951
High-Skilled, Java             Low-Skilled, Everywhere Else     0.456311
                               High-Skilled, Everywhere Else    0.543689
Name: Household, dtype: float64

In [48]:
#Simulation: State Transitions Conidtional on Migration, the transition between states
((Data_Sim[Data_Sim.Migrate==0]
  .groupby(['Parent_State','Child_State'],as_index=True)
  .agg({'Household':'count'})
  .rename(index=index))
  .Household) / \
np.repeat(Data_Sim[Data_Sim.Migrate==0]
          .groupby(['Parent_State'],as_index=False)
          .agg({'Household':'count'})
          .Household
          .get_values(),[2]*4,axis=0)

Parent_State                   Child_State                  
Low-Skilled, Everywhere Else   Low-Skilled, Everywhere Else     0.471504
                               High-Skilled, Everywhere Else    0.528496
Low-Skilled, Java              Low-Skilled, Java                0.480597
                               High-Skilled, Java               0.519403
High-Skilled, Everywhere Else  Low-Skilled, Everywhere Else     0.472780
                               High-Skilled, Everywhere Else    0.527220
High-Skilled, Java             Low-Skilled, Java                0.480898
                               High-Skilled, Java               0.519102
Name: Household, dtype: float64

Repeat the above using pandas crosstabs

In [49]:
Sim_Cross_Mig = ( pd.crosstab(Data_Sim[Data_Sim.Migrate==1].Parent_State,Data_Sim.Child_State,normalize='index')
                     .rename(index=index, columns=index) )

In [50]:
Sim_Cross_NoMig = ( pd.crosstab(Data_Sim[Data_Sim.Migrate==0].Parent_State,Data_Sim.Child_State,normalize='index')
                     .rename(index=index, columns=index) )

#### Comparisons

Comparison table: Add the two cross tabs to compare between the parents who chose to migrate and those who chose to stay in their current location --> the frequency of the children in the new states

In [51]:
Data_Cross_Mig + Data_Cross_NoMig

Child_State,"Low-Skilled, Everywhere Else","Low-Skilled, Java","High-Skilled, Everywhere Else","High-Skilled, Java"
Parent_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Low-Skilled, Everywhere Else",0.519059,0.5,0.480941,0.5
"Low-Skilled, Java",0.25,0.540856,0.75,0.459144
"High-Skilled, Everywhere Else",0.260317,0.142857,0.739683,0.857143
"High-Skilled, Java",0.285714,0.304878,0.714286,0.695122


The above Data State Transitinos show that in the data, we observe more parents educating their children conditional on having chosen to migrate, as there are RELATIVELY more children in the high skilled states in the new location than their counterparts whose parents chose to stay. 

In [52]:
Sim_Cross_Mig + Sim_Cross_NoMig

Child_State,"Low-Skilled, Everywhere Else","Low-Skilled, Java","High-Skilled, Everywhere Else","High-Skilled, Java"
Parent_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Low-Skilled, Everywhere Else",0.471504,0.503704,0.528496,0.496296
"Low-Skilled, Java",0.455446,0.480597,0.544554,0.519403
"High-Skilled, Everywhere Else",0.47278,0.378049,0.52722,0.621951
"High-Skilled, Java",0.456311,0.480898,0.543689,0.519102


The simulated results are not that close to the data, but the basic pattern is captured wherein if the parent chooses to migrate we see a corresponding increase in the proportion of children who are high skilled in the new location relative to those who chose to stay. Put differently, among parents who moved we also observe higher rates of high-skilled children relative to the cohort of children whose parents didn't move. 

The troublesome part is that the low-skilled parents located elsewhere defy the pattern of the data: low-skilled parents located elsewhere who chose to move (to Java) also choose to decrease the rates at which they educate their children in the new location, so that we see relatively more high-skilled children in the old location vs. the new location (Java). This goes counter to the data and to the overall pattern both in the data and the rest of the simulated outcomes.  