# Workbook 4 Predicted probabilities, Model Building, and Model Fit

In [1]:
use "https://www.stata-press.com/data/r17/nhanes2.dta", clear
desc




Contains data from https://www.stata-press.com/data/r17/nhanes2.dta
  obs:        10,351                          
 vars:            58                          20 Dec 2020 10:07
 size:     1,107,557                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
sampl           long    %9.0g                 Unique case identifier
strata          byte    %9.0g                 Stratum identifier
psu             byte    %9.0g      psulbl     Primary sampling unit
region          byte    %9.0g      region     Region
smsa            byte    %22.0g     smsalbl    SMSA type
location        byte    %9.0g                 Location (stand office ID)
houssiz         byte    %9.0g                 Number of people in household
sex             byte    %9.0g      sex        S

In [2]:
tab highbp
tab region
*We can use the robust standard errors feature in logistic regression
logistic highbp i.female age i.region, vce(robust)



 High blood |
   pressure |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      5,975       57.72       57.72
          1 |      4,376       42.28      100.00
------------+-----------------------------------
      Total |     10,351      100.00


     Region |      Freq.     Percent        Cum.
------------+-----------------------------------
         NE |      2,096       20.25       20.25
         MW |      2,774       26.80       47.05
          S |      2,853       27.56       74.61
          W |      2,628       25.39      100.00
------------+-----------------------------------
      Total |     10,351      100.00


Logistic regression                             Number of obs     =     10,351
                                                Wald chi2(5)      =    1418.62
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -6267.0671               Pseudo R2         =     0.1112

-

Here, we are making a model to assess high blood pressure and we included gender, age, and region. Based on the model we find:
* Women have 35% significantly lower likelihood of high blood pressure as compared to men.
* We see a general positive, significant increase in age and high blood pressure. With every year increase there is a 5% likelihood signifcant increase in high blood pressure.
* Generally, we see Northeast region associated with higher blood pressure. Mountain west has 12% significant lower likelihood in high blood pressure as compared to Northeast. South has 7% lower likelihood in high blood pressure as compared to Northeast. West has 6% lower likelihood in high blood pressure as compared to Northeast. 
* For men at zero age in the northeast, an odds ratio of high blood pressure is .10 (intercept).

## Predicted probabilites
In the last workbook, I stressed that we have <b>predicted probabilities</b> instead of <b>predicted values</b>. That is because within in logistic models we have the logit transformation, we are estimating the probability instead of value. 

Logistic regression model:

$Pr(outcome_i=1|x)=\pi(x)=\frac{e^{\beta_0+\beta_{1}x+...+\beta_{k}x}}{1+e^{\beta_0+\beta_{1}x+...+\beta_{k}x}}$

where the logit is $g(x) = ln{\frac{\pi(x)}{(1-\pi(x))}} = \beta_0 + \beta_{1}x...+\beta_{k}x$ 

The beta are technically logits not odds ratio.

In [15]:
logit highbp i.female age i.region, vce(robust)


Iteration 0:   log pseudolikelihood = -7050.7655  
Iteration 1:   log pseudolikelihood = -6271.3989  
Iteration 2:   log pseudolikelihood = -6267.0675  
Iteration 3:   log pseudolikelihood = -6267.0671  

Logistic regression                             Number of obs     =     10,351
                                                Wald chi2(5)      =    1418.62
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -6267.0671               Pseudo R2         =     0.1112

------------------------------------------------------------------------------
             |               Robust
      highbp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |
     Female  |  -.4317336   .0431903   -10.00   0.000    -.5163851   -.3470822
         age |   .0478323   .0013109    36.49   0.000      .045263    .0504016
             |
      region |
 

<center>$g(x)=  ln{\frac{\pi(x)}{(1-\pi(x))}}  = \beta_0+\beta_1*(female_i)+\beta_2*(age_i)+\beta_3*(mw_i)+\beta_4*(south_i)+\beta_5*(west_i)$
    
<center> where $\pi(x)=Pr(high_bp_i=1|x)$

Our model estimated this:

<center>$g(x)= -2.351029 -.4317336*(female_i)+ .0478323*(age_i)-.1234056*(mw_i)-.075136*(south_i)-.0599217*(west_i)$

#### Please note the equation has the logits and not odds ratios

### Margins
We can use ```margins``` command in **Stata** to calculate the predicted probablites. You can also calculate them by hand, but usually with a lot of independent variables it can get messy.

In [3]:
quietly logit highbp i.female age i.region, vce(robust)
*find means of variables
summ female age 
tab region




    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      female |     10,351    .5251667    .4993904          0          1
         age |     10,351    47.57965    17.21483         20         74


     Region |      Freq.     Percent        Cum.
------------+-----------------------------------
         NE |      2,096       20.25       20.25
         MW |      2,774       26.80       47.05
          S |      2,853       27.56       74.61
          W |      2,628       25.39      100.00
------------+-----------------------------------
      Total |     10,351      100.00


<center>
$\frac{e^{\beta_0+\beta_1*x}}{(1-e^{\beta_0+\beta_1*x})}$

In [4]:
di exp(-2.351029 -.4317336*.5251667 + .0478323*47.57965 -.1234056*.2680 -.075136*.2756 -.0599217*.2539)/(1+exp(-2.351029 -.4317336*.5251667 + .0478323*47.57965 -.1234056*.2680 -.075136*.2756 -.0599217*.2539))

.40832083


In [5]:
*margins with "atmeans" will calculate everything at the mean
margins, atmeans


Adjusted predictions                            Number of obs     =     10,351
Model VCE    : Robust

Expression   : Pr(highbp), predict()
at           : 0.female        =    .4748333 (mean)
               1.female        =    .5251667 (mean)
               age             =    47.57965 (mean)
               1.region        =    .2024925 (mean)
               2.region        =    .2679934 (mean)
               3.region        =    .2756255 (mean)
               4.region        =    .2538885 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   .4083206   .0052204    78.22   0.000     .3980889    .4185524
------------------------------------------------------------------------------


Having means of categorical variables doesn't make sense

In [6]:
margins, at(female=1 age=40 region==1)
margins, at(female=0 age=40 region==1)



Adjusted predictions                            Number of obs     =     10,351
Model VCE    : Robust

Expression   : Pr(highbp), predict()
at           : female          =           1
               age             =          40
               region          =           1

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   .2953644   .0109949    26.86   0.000     .2738147     .316914
------------------------------------------------------------------------------


Adjusted predictions                            Number of obs     =     10,351
Model VCE    : Robust

Expression   : Pr(highbp), predict()
at           : female          =           0
               age             =          40
               region          =           1

------------

Based on our model, we see that the predicted probabilities of having high blood pressure for a 40 year old woman in the Northeast region is 30%. The estimated probability of having high blood pressure for a 40 year old man in the Northeast region is 39%. A 9% difference. This demonstrates a gender gap for high blood pressure based on our model.

Using predicted probabilites is a accessible way to present findings.

In [7]:
describe


Contains data from https://www.stata-press.com/data/r17/nhanes2.dta
  obs:        10,351                          
 vars:            58                          20 Dec 2020 10:07
 size:     1,107,557                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
sampl           long    %9.0g                 Unique case identifier
strata          byte    %9.0g                 Stratum identifier
psu             byte    %9.0g      psulbl     Primary sampling unit
region          byte    %9.0g      region     Region
smsa            byte    %22.0g     smsalbl    SMSA type
location        byte    %9.0g                 Location (stand office ID)
houssiz         byte    %9.0g                 Number of people in household
sex             byte    %9.0g      sex        Sex

In [8]:
logit highbp i.female age i.region, vce(robust)
fitstat



Iteration 0:   log pseudolikelihood = -7050.7655  
Iteration 1:   log pseudolikelihood = -6271.3989  
Iteration 2:   log pseudolikelihood = -6267.0675  
Iteration 3:   log pseudolikelihood = -6267.0671  

Logistic regression                             Number of obs     =     10,351
                                                Wald chi2(5)      =    1418.62
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -6267.0671               Pseudo R2         =     0.1112

------------------------------------------------------------------------------
             |               Robust
      highbp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |
     Female  |  -.4317336   .0431903   -10.00   0.000    -.5163851   -.3470822
         age |   .0478323   .0013109    36.49   0.000      .045263    .0504016
             |
      region |


## Comparing models
We often estimate mutiple models in our research. For example, when I publish a paper, I will include at least three models. How can we assess model fit? How do we build models? 

#### Ways to assess model fit
In OLS, you can compare R-squared. Models estimated with MLE, we can use Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These model fit statistics use log-likelihood information. Generally smaller AIC and BIC is better. But these criterion are relative.

For nested models, you can compare likelihood ratio tests.

#### Steps I usually follow for model building:
* Null models: Generally, I will start with a null model (without any independent variables). Now, null models can be estimated for contiunous dependent variable. However, null models cannot be estimated for categorical outcomes.

* Slowly building models: You can slowly start to add independent variables to the model. I usually group similar variables together like racial/ethnic model or socio-economic model. When I have set of variable aligned with a theory, I will make different theory models. For example, treadmill of production model and treadmill of destruction model. I'll make notes of significant and model fit statistics.

* Fit a full/saturated model: This is the model with all the independent variables included. I'll make notes of significant and model fit statistics.

#### Let's practice with an example
This is a dataset from Hosmer et al of low birth study at Baystate Medical Center.

In [9]:
import delimited "C:\Users\cam\Downloads\alr\logistic\lowbwt11.dat", delimiters(" ") clear

gen pair=v12
replace pair=v11 if v11!=.
label variable pair "pair"

gen low=v18
replace low=v17 if v17!=.
label variable low "low birth weight 1<=25000g"

gen age=v23
replace age=v22 if v22!=.
label variable age "age"

gen lwt=v26
replace lwt=v27 if v27!=.
replace lwt=v28 if v28!=.
label variable lwt "weight of mother at last menstrual period"

gen race=v33
replace race=v34 if v34!=.
replace race=v35 if v35!=.
label variable race "race 1=wht 2=blk 3=other"

gen smoke=v41
replace smoke=v42 if v42!=.
replace smoke=v43 if v43!=.
label variable smoke "smoking status"

gen ptd=v47
replace ptd=v48 if v48!=.
replace ptd=v49 if v49!=.
label variable ptd "history of premature labor"

gen ht=v52
replace ht=v53 if v53!=.
replace ht=v54 if v54!=.
label variable ht "history of hypertension"

gen ui=v57
replace ui=v58 if v58!=.
replace ui=v59 if v59!=.
label variable ui "presence of uterine irritability"

drop v1-v59

summ pair low age lwt smoke ptd ht ui
tab race


(59 vars, 112 obs)

(94 missing values generated)

(94 real changes made)


(94 missing values generated)

(94 real changes made)


(94 missing values generated)

(94 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)


(31 missing values generated)

(29 real changes made)

(2 real changes made)




    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        pair |        112        28.5    16.23587          1         56
         low |        112          .5    .5022472          0          1
         age |        112    22.50893    4.3412

Given the variables, I am going to group them by demographics, health status, and activity. I will run each group variables as single models and then I will run the full saturate model.

<b>Demographic model</b>
<center>$g(x)=  ln{\frac{\pi(x)}{(1-\pi(x))}}  = \beta_0+\beta_1*(black_i)+\beta_2*(other_i)+\beta_3*(age_i)$
    
<center> where $\pi(x)=Pr(lbw_i=1|x)$
    
<b>Health status model</b>
<center>$g(x)=  ln{\frac{\pi(x)}{(1-\pi(x))}}  = \beta_0+\beta_1*(hist_premature_labor_i)+\beta_2*(history_hypertension_i)+\beta_3*(pres_uterine_irr_i)$
    
<center> where $\pi(x)=Pr(lbw_i=1|x)$

<b>Activity model</b>
<center>$g(x)=  ln{\frac{\pi(x)}{(1-\pi(x))}}  = \beta_0+\beta_1*(weight_i)+\beta_2*(smoke_i)$
    
<center> where $\pi(x)=Pr(lbw_i=1|x)$
    
<b>Full model</b>
<center>$g(x)=  ln{\frac{\pi(x)}{(1-\pi(x))}}  = \beta_0+ \beta_1*(black_i)+ \beta_2*(other_i)+\beta_3*(age_i)+ \beta_4*(hist_premature_labor_i)+ \beta_5*(history_hypertension_i)+ \beta_6*(pres_uterine_irr_i)+ \beta_7*(weight_i)+\beta_8*(smoke_i)$
    
<center> where $\pi(x)=Pr(lbw_i=1|x)$

#### estimate command
```estimates store model_name``` command will save model statistics for later comparison. It is a post-estimation command meaning it has to be used after estimating a model. 

```estimates table model_name, stats(aic bic)``` will show you the saved information.

In [12]:
*demographic model
logit low i.race age
estimates store demo



Iteration 0:   log likelihood = -77.632484  
Iteration 1:   log likelihood = -77.597923  
Iteration 2:   log likelihood = -77.597923  

Logistic regression                             Number of obs     =        112
                                                LR chi2(3)        =       0.07
                                                Prob > chi2       =     0.9953
Log likelihood = -77.597923                     Pseudo R2         =     0.0004

------------------------------------------------------------------------------
         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .0942566   .5358725     0.18   0.860    -.9560342    1.144547
          3  |  -.0433902   .4235261    -0.10   0.918    -.8734862    .7867058
             |
         age |  -.0006389    .044323    -0.01   0.988    -.0875104    .0862326
       _cons |   .0149276   1.078548     

For the demographics model, we find none of the variables are significant.

In [13]:
*health model
logit low i.ptd i.ht i.ui
estimates store health



Iteration 0:   log likelihood = -77.632484  
Iteration 1:   log likelihood = -71.573504  
Iteration 2:   log likelihood = -71.563778  
Iteration 3:   log likelihood = -71.563777  

Logistic regression                             Number of obs     =        112
                                                LR chi2(3)        =      12.14
                                                Prob > chi2       =     0.0069
Log likelihood = -71.563777                     Pseudo R2         =     0.0782

------------------------------------------------------------------------------
         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       1.ptd |   1.142014    .507797     2.25   0.025     .1467501    2.137278
        1.ht |   1.185478   .7380595     1.61   0.108    -.2610916    2.632048
        1.ui |   1.031845   .5509494     1.87   0.061    -.0479955    2.111686
       _cons |  -.5210001 

For the health status model, we find: 1) history of premature labor significantly increases likelihood of low birth weight; 2) history of hypertension marginally (p=.108) increases likelihood of low birth weight; 3) presence of uterine irritability marginally (p=.061) increases likelihood of low birth weight.

In [14]:
*activity model
logit low lwt i.smoke
estimates store activity



Iteration 0:   log likelihood = -77.632484  
Iteration 1:   log likelihood = -72.365234  
Iteration 2:   log likelihood = -72.350205  
Iteration 3:   log likelihood = -72.350204  

Logistic regression                             Number of obs     =        112
                                                LR chi2(2)        =      10.56
                                                Prob > chi2       =     0.0051
Log likelihood = -72.350204                     Pseudo R2         =     0.0680

------------------------------------------------------------------------------
         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         lwt |  -.0119339   .0068241    -1.75   0.080    -.0253088    .0014411
     1.smoke |   1.094499   .4073455     2.69   0.007     .2961168    1.892882
       _cons |   1.072159   .8880152     1.21   0.227    -.6683192    2.812636
--------------------------

Mothers with higher weight are marginally (p=.080) less likely to have children with low birth weight. We find that mother smokers are significantly more likely to have low birth weight.

In [15]:
logit low i.race age ptd ht ui lwt i.smoke
estimates store full



Iteration 0:   log likelihood = -77.632484  
Iteration 1:   log likelihood = -65.552263  
Iteration 2:   log likelihood = -65.494219  
Iteration 3:   log likelihood = -65.494139  
Iteration 4:   log likelihood = -65.494139  

Logistic regression                             Number of obs     =        112
                                                LR chi2(8)        =      24.28
                                                Prob > chi2       =     0.0021
Log likelihood = -65.494139                     Pseudo R2         =     0.1564

------------------------------------------------------------------------------
         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .7244131   .6219839     1.16   0.244    -.4946529    1.943479
          3  |   .3722791   .5922604     0.63   0.530      -.78853    1.533088
             |
         age |   .0050955   .0

In [None]:
estimates table demo health activity full, stats(r2_p aic bic)

*save to excel
cd "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\results"
putdocx begin
putdocx table results = etable
putdocx save myresults.docx

|  | Demographics | Health status | Activity | Full |
| --- | --- | --- | --- | --- |
| White | ref | | | | 
| Black | .094 (.536) | | | .724 (.622) |
| Other |  -.043 (.424) | | |.372 (.592)| 
| Age | -.001 ( .044) | | | .005 (.0515) |
| His. premat labor | | 1.142 (.508)* | | .983 (.543) |
| His. hypertension | | 1.185 (.738) | | 1.777 (.853)* |
| Pres. uterine irri. | |  1.032 (.551) | | 1.114 (.589) |
| Weight | | | -.0119 (.007) | -.0169 (.0082)* | 
| Smoker | | | 1.094 (.407)** | 1.220 (.526)* |
| intercept | .015 ( 1.079) | -.521 (.249)* | 1.072 (.888) |  .691 (1.574) |
| N | 112 | 112 | 112 | 112 |
| AIC | 163.19585 | 151.12755 | 150.70041 | 148.98828 |
| BIC | 174.06984 | 162.00155 | 158.85591 | 173.45477 |

Notice any changes from the singular model? Significance changes from history of premature labor, hypertension, and weight. Based on the AIC, the full model is the relatively best. BIC suggests the activity is the relatively the best. I am starting to see that demographic variables are not as important.

Next we can compare nested models.

### lrtest
```lrtest``` command is a postestimation command and tests nested models. You want to store the larger model first, then estimate the smaller model.

```lrtest full restricted```

In [16]:
quietly logit low i.race age ptd ht ui lwt i.smoke
estimate store full

quietly logit low i.race age
estimate store demo

lrtest full demo







Likelihood-ratio test                                 LR chi2(5)  =     24.21
(Assumption: demo nested in full)                     Prob > chi2 =    0.0002


The likelihood ratio tests whether the full model adds improves the model fit relative to the smaller model.

## Practice: Logistic regression
* Q1: Open the nhanes2 dataset: https://www.stata-press.com/data/r17/nhanes2.dta
* Q2: Use the ```describe``` command and observe the variables. Use the ```codebook variable``` to see variable breakdown. Use ```summ num_variable``` and ```tab cat_variable``` to see the descriptive statistics of different variables.
* Q3: Use either highbp or diabetes as the outcome variable. Estimate four models including three subnested models and one full/saturated model.
* Q3: Estimate the models with Stata. Compare the models.

In [17]:
use "https://www.stata-press.com/data/r17/nhanes2.dta", clear
describe




Contains data from https://www.stata-press.com/data/r17/nhanes2.dta
  obs:        10,351                          
 vars:            58                          20 Dec 2020 10:07
 size:     1,107,557                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
sampl           long    %9.0g                 Unique case identifier
strata          byte    %9.0g                 Stratum identifier
psu             byte    %9.0g      psulbl     Primary sampling unit
region          byte    %9.0g      region     Region
smsa            byte    %22.0g     smsalbl    SMSA type
location        byte    %9.0g                 Location (stand office ID)
houssiz         byte    %9.0g                 Number of people in household
sex             byte    %9.0g      sex        S

In [5]:
tab highbp


 High blood |
   pressure |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      5,975       57.72       57.72
          1 |      4,376       42.28      100.00
------------+-----------------------------------
      Total |     10,351      100.00


In [18]:
logit diabetes i.region i.rural, vce(robust)
estimate store dia_geo

logit highbp i.region i.rural, vce(robust)
estimate store hbp_geo



Iteration 0:   log pseudolikelihood = -1999.7591  
Iteration 1:   log pseudolikelihood = -1996.8303  
Iteration 2:   log pseudolikelihood = -1996.8158  
Iteration 3:   log pseudolikelihood = -1996.8158  

Logistic regression                             Number of obs     =     10,349
                                                Wald chi2(4)      =       6.01
                                                Prob > chi2       =     0.1981
Log pseudolikelihood = -1996.8158               Pseudo R2         =     0.0015

------------------------------------------------------------------------------
             |               Robust
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      region |
         MW  |  -.0380387   .1378366    -0.28   0.783    -.3081935    .2321161
          S  |    .200742   .1333676     1.51   0.132    -.0606536    .4621376
          W  |  -.0686786   .1

In [28]:
lrtest dia_full dia_geo

LR test likely invalid for models with robust vce


r(498);





In [19]:
logit diabetes i.sex i.race houssiz, vce(robust)
estimate store dia_demo

logit highbp i.sex i.race  houssiz, vce(robust)
estimate store hbp_demo



Iteration 0:   log pseudolikelihood = -1999.7591  
Iteration 1:   log pseudolikelihood = -1969.5459  
Iteration 2:   log pseudolikelihood = -1967.3389  
Iteration 3:   log pseudolikelihood =  -1967.335  
Iteration 4:   log pseudolikelihood =  -1967.335  

Logistic regression                             Number of obs     =     10,349
                                                Wald chi2(4)      =      63.95
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -1967.335               Pseudo R2         =     0.0162

------------------------------------------------------------------------------
             |               Robust
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |   .1454004   .0930728     1.56   0.118    -.0370189    .3278198
             |
        race |
      Black  |   .6778764   

In [23]:
logit diabetes weight tcresult heartatk, vce(robust)
estimate store dia_health

logit highbp weight tcresult heartatk, vce(robust)
estimate store hbp_health



Iteration 0:   log pseudolikelihood = -1999.7591  
Iteration 1:   log pseudolikelihood = -1965.2895  
Iteration 2:   log pseudolikelihood = -1943.6421  
Iteration 3:   log pseudolikelihood = -1943.4857  
Iteration 4:   log pseudolikelihood = -1943.4856  

Logistic regression                             Number of obs     =     10,349
                                                Wald chi2(3)      =     126.33
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1943.4856               Pseudo R2         =     0.0281

------------------------------------------------------------------------------
             |               Robust
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |    .018354   .0028067     6.54   0.000     .0128531     .023855
    tcresult |   .0032118   .0008891     3.61   0.000     .0014691    .00

In [24]:
logit diabetes i.sex i.race houssiz weight tcresult heartatk i.region i.rural, vce(robust)
estimate store dia_full

logit highbp i.sex i.race houssiz weight tcresult heartatk i.region i.rural, vce(robust)
estimate store hbp_full



Iteration 0:   log pseudolikelihood = -1999.7591  
Iteration 1:   log pseudolikelihood = -1935.1592  
Iteration 2:   log pseudolikelihood = -1904.1958  
Iteration 3:   log pseudolikelihood = -1904.0324  
Iteration 4:   log pseudolikelihood = -1904.0323  

Logistic regression                             Number of obs     =     10,349
                                                Wald chi2(11)     =     200.77
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1904.0323               Pseudo R2         =     0.0479

------------------------------------------------------------------------------
             |               Robust
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |    .427921   .1011299     4.23   0.000     .2297101     .626132
             |
        race |
      Black  |   .5884612   

In [27]:
est table dia_demo dia_health dia_geo dia_full, b(%7.2f) p(%4.3f) stat(bic aic)
est table hbp_demo hbp_health hbp_geo hbp_full, b(%7.2f) p(%4.3f) stat(bic aic)



------------------------------------------------------
    Variable | dia_d~o   dia_h~h   dia_geo   dia_f~l  
-------------+----------------------------------------
         sex |
     Female  |    0.15                          0.43  
             |   0.118                         0.000  
             |
        race |
      Black  |    0.68                          0.59  
             |   0.000                         0.000  
      Other  |    0.20                          0.57  
             |   0.555                         0.112  
             |
     houssiz |   -0.19                         -0.19  
             |   0.000                         0.000  
      weight |              0.02                0.02  
             |             0.000               0.000  
    tcresult |              0.00                0.00  
             |             0.000               0.012  
    heartatk |              1.12                1.14  
             |             0.000               0.000  
   