Model specification
Two main components of models are distinguished in SEM: the *structural model* showing potential causal dependencies
between endogenous and exogenous variables, and the *measurement model* showing the relations between latent variables
and their indicators. Exploratory and confirmatory factor analysis models, for example, contain only the measurement part,
while path diagrams can be viewed as SEMs that contain only the structural part.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import FactorAnalysis

In [2]:
df = pd.read_csv('retail_may.csv')

In [3]:
sample = df.sample(n=1000)

In [11]:
sample.var()

Quantity               3.867069e+02
UnitPrice              3.895306e+04
CustomerID             2.948601e+06
AcceptDiscount         2.500541e-01
NormalPrice            3.316560e+01
LenghtDiscount         6.783151e+00
PromotionDays          8.604916e+00
Elasticity             1.289273e+00
KindOfSale             6.669379e-01
Display                2.502012e-01
DisplayInStore         2.502342e-01
TypeofPromotion        1.230747e+00
TypeofProduct          8.128673e+00
CompetitionPressure    2.501862e-01
Seasonality            2.499610e-01
Revenue                4.067894e+04
Clustering             2.118294e+00
dtype: float64

In [33]:
df.corr()

Unnamed: 0,Quantity,UnitPrice,CustomerID,AcceptDiscount,NormalPrice,LenghtDiscount,PromotionDays,Elasticity,KindOfSale,Display,DisplayInStore,TypeofPromotion,TypeofProduct,CompetitionPressure,Seasonality,Revenue,Clustering
Quantity,1.0,-0.001512,-0.004612,-0.002522,0.001144,0.001361,-7.6e-05,0.000227,-0.000172,0.002672,-0.00024,0.001826,0.000161,0.000635,0.002169,0.78703,-0.00053
UnitPrice,-0.001512,1.0,-0.002132,-7.6e-05,0.00063,-2.1e-05,-0.000797,6.7e-05,0.000839,-0.002557,-4e-05,-0.002031,-0.000789,0.001478,-0.000447,-0.294454,-0.003904
CustomerID,-0.004612,-0.002132,1.0,0.002601,-0.001197,0.003026,-0.001908,0.000759,-0.000307,0.000846,0.001,-0.000126,-0.000141,-0.00262,0.00168,-0.003519,-0.00281
AcceptDiscount,-0.002522,-7.6e-05,0.002601,1.0,-0.001518,-0.001547,-0.001945,0.002362,0.001868,0.000141,0.00281,0.000792,0.003488,-0.002771,-0.000518,-0.00202,0.069982
NormalPrice,0.001144,0.00063,-0.001197,-0.001518,1.0,-0.000365,-0.001701,0.001022,0.000157,-0.001041,0.001723,0.002671,-0.003759,0.002485,0.001736,0.001991,0.001393
LenghtDiscount,0.001361,-2.1e-05,0.003026,-0.001547,-0.000365,1.0,-0.002573,0.002105,-0.002654,-0.002052,0.000591,0.002679,0.00191,-0.000765,0.00272,0.000982,0.004726
PromotionDays,-7.6e-05,-0.000797,-0.001908,-0.001945,-0.001701,-0.002573,1.0,0.002284,-0.000786,0.001688,0.002598,0.001237,0.002605,0.0011,0.000168,0.000352,0.000704
Elasticity,0.000227,6.7e-05,0.000759,0.002362,0.001022,0.002105,0.002284,1.0,-0.000249,-0.000853,-0.00088,-0.001909,-0.000235,-0.003445,0.000576,0.000491,0.104851
KindOfSale,-0.000172,0.000839,-0.000307,0.001868,0.000157,-0.002654,-0.000786,-0.000249,1.0,0.001997,-0.002015,0.001403,0.001039,-0.000887,0.002217,0.001637,0.089412
Display,0.002672,-0.002557,0.000846,0.000141,-0.001041,-0.002052,0.001688,-0.000853,0.001997,1.0,0.001166,-0.000492,0.000885,0.000739,-0.00234,0.003612,-0.144741


In [25]:
columns_names = ['Casual', 'Party', 'Sports', 'Luxury']
xx = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list(columns_names))


In [29]:
factor = FactorAnalysis().fit(xx)
results = pd.DataFrame(factor.components_, columns=columns_names)

After loading the data and having stored all the predictive features, the FactorAnalysis class is initialized with a request to look for four factors. 
The data is then fitted. 
You can explore the results by observing the components_ attribute, which returns an array containing measures of the relationship between the newly **created factors, placed in rows, and the original features, placed in columns**.

At the intersection of each factor and feature, a **positive number indicates that a positive proportion exists between the two**; a **negative number, instead, points out that they diverge** and one is the contrary to the other.

In [30]:
# created factors, placed in rows, and the original features, placed in columns
results

Unnamed: 0,Casual,Party,Sports,Luxury
0,20.728071,-13.897304,-21.98467,-1.674322
1,2.090272,-6.622537,8.436444,-29.928374
2,16.824917,-7.394453,19.896295,8.419858
3,13.60935,22.721985,-1.195721,-4.414459


In [35]:
factor.score(xx)

-19.281376465335935

In [41]:
# Compute data precision matrix with the FactorAnalysis model.
factor.get_precision()

array([[ 1.14453743e-03,  1.52257007e-04,  1.32273195e-04,
         1.47969126e-05],
       [ 1.52257007e-04,  1.27170934e-03, -8.49446973e-05,
        -7.69425331e-05],
       [ 1.32273195e-04, -8.49446973e-05,  1.07528591e-03,
         5.37257701e-05],
       [ 1.47969126e-05, -7.69425331e-05,  5.37257701e-05,
         1.01735086e-03]])

In [39]:
# Get parameters for this estimator.
factor.get_params()

{'copy': True,
 'iterated_power': 3,
 'max_iter': 1000,
 'n_components': None,
 'noise_variance_init': None,
 'random_state': 0,
 'svd_method': 'randomized',
 'tol': 0.01}

In [37]:
# Compute data covariance with the FactorAnalysis model.
factor.get_covariance()

array([[ 903.3144, -117.0868, -119.5848,  -15.6784],
       [-117.0868,  808.9596,   75.3656,   58.9048],
       [-119.5848,   75.3656,  952.7916,  -42.8772],
       [ -15.6784,   58.9048,  -42.8772,  989.8924]])

In [None]:
# structural_model: dependencies between endogenous and exogenous variables
# y1 (endogenous) = x1 (exogenous) + x2 + x3+ x4 + U


In [None]:
# step 1: specification - setup the endogenous (y) variables
# step 2: specification - setup the exogenous (x) variables related to endogenous(y)
# step 3: setup the relations between endogenous variables
# step 4: estimation - confirmatory analysis between endogenous and exogenous, multiple regression
# step 5: STEP 4: MODEL FIT The estimated model parameters are used to predict the correlations or covariances
# between measured variables and the predicted correlations or covariances are compared to the observed correlations
# or covariances (see measures of model fit).   If the fit of the model is poor, then the model needs to be re-specified and the researcher returns to Step 1.

In [24]:
labels = ['y1', 'x11', 'x12']

In [26]:
df = pd.DataFrame(data=variables)

In [21]:
df.transpose()

Unnamed: 0,0,1,2
0,15,13,9
1,20,11,15
2,25,19,21
3,17,44,22


In [29]:
df.corr()

Unnamed: 0,0,1,2,3
0,1.0,0.387147,0.5,0.015192
1,0.387147,1.0,0.992065,-0.91603
2,0.5,0.992065,1.0,-0.85833
3,0.015192,-0.91603,-0.85833,1.0


In [14]:
variables = [y1, x11, x12]
import itertools
combinations = list(itertools.combinations(variables, 2))
for x, z in combinations:
    print(np.corrcoef(x, z))

[[ 1.         -0.18508027]
 [-0.18508027  1.        ]]
[[1.         0.56326882]
 [0.56326882 1.        ]]
[[1.         0.69759998]
 [0.69759998 1.        ]]


In [31]:
a = [np.corrcoef(x, z) for x, z in combinations]

In [66]:
for z, x in zip(itertools.combinations(labels, 2), range(len(a))):
    print(z, a[x][0][1])

('y1', 'x11') -0.18508027356882345
('y1', 'x12') 0.5632688173469869
('x11', 'x12') 0.6975999841917071


In [52]:
for x in range(len(a)):
    print(a[x][0][1])

-0.18508027356882345
0.5632688173469869
0.6975999841917071


In [41]:
a[1][0][1]

0.5632688173469869

In [39]:
a[0][0][1]

-0.18508027356882345

In [36]:
for x in np.nditer(a):
    print(x)

(array(1.), array(1.), array(1.))
(array(-0.18508027), array(0.56326882), array(0.69759998))
(array(-0.18508027), array(0.56326882), array(0.69759998))
(array(1.), array(1.), array(1.))


In [9]:
np.corrcoef(y1, x11, "full")

array([[ 1.        , -0.18508027],
       [-0.18508027,  1.        ]])