# MSDS 7333 - Case Study One: Using Multiple Imputation

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab2)
- [Ben Brock](bbrock@smu.edu?subject=lab2)
- [Tom Elkins](telkins@smu.edu?subject=lab2)
- [Austin Kelly](ajkelly@smu.edu?subject=lab2)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Instructions</h3>
    <p>Carry out the following steps for Case Study One: Using Multiple Imputation.</p>
     <p>Steps:</p>
    <ol>
        <li>Use PROC MI to discover the missing values patterns and to decide what MI options to use. (Assume no need for transformations.)</li>
        <li>Use PROC MI to create multiple imputed data sets.</li>
        <li>Use PROC REG to analzye the multiple data sets while outputting information to be used in MIANALYZE.</li>
        <li>Use PROC MIANALYZE to summarize the imputed analyses.</li>
        <li>Compare these results to the listwise deletion results.</li>
    </ol> 
    <p>Report Sections:</p>
    <ol>
        <li>[Introduction](#introduction) <b>(5 points)</b></li>
        <li>[Background](#background) <b>(10 points)</b></li>
        <li>[Methods](#methods) <b>(30 points)</b></li>
        <li>[Results](#results) <b>(30 points)</b></li>
        <li>[Conclusion](#conclusion) <b>(5 points)</b></li>
        <li>[Bibliography and Citation](#biblio) <b>(5 points)</b></li>
        <li>[Code](#code) <b>(5 points)</b></li>
    </ol>
     <p>Other Grading Criterium:</p>
    <ol>
        <li>Grammar and Organization <b>(10 points)</b></li>
    </ol>
</div>

<a id='introduction'></a>
## 1 - Introduction
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Introduction (<b>5 points total</b>)</h3>
</div>

### Predicting Miles per Gallon from Engine Type, Cylinders, Size, Weight, Horsepower, and Acceleration.

Miles per gallon is a very important attribute when new car owners or seasoned car owners plan to purchase a new vehicle.  The “CARMPG” data set contains data for 38 cars as measured in 2005.  In today’s global economy, car owners can purchase various domestic or foreign car models that are fuel-efficient, environment friendly and easy to maintain. 


In this exercise, we will predict the miles per gallon (mpg) of the vehicle based on the following attributes: Engine Type, Cylinders, Size, Horsepower, Weight and Acceleration.


Objective of Case Study 2
- Use PROC MI to discover the missing values patterns and to decide what 
	MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information
	to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deletion results.


#### NOTE: Based on the objective of this case study, our team decided to use the SAS programming language to solve the problem.


<a id="background"></a>
## 2 - Background

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Background (<b>10 points total</b>)</h3>
</div>

## Background: Brief Background of Data Set and Multiple Imputation

### Brief Background of the "CARMPG" Data Set
The response variable is mpg, there are two (2) categorical variables which are the variables Engine Type (ENG_TYPE) and Cylinders, and the continuous variables are Size, Horsepower (HP), Weight and Acceleration (ACCEL).  The data set consists of a sample of 38 observations based on different aspects of car attributes (mpg, Engine Type, Cylinders, Size, Horsepower, Weight, and Acceleration).  The objective of this exercise is to examine the relationship between each of these predictor variables and the response variable of mpg in the hopes that a customer will be able to identify important relationships and address specific mpg factors when purchasing a vehicle. The ultimate goal is to fit a model using some or all of these variables to predict mpg. 

#### Field Definitions
Variables used in the CARMPG data set. 

|Column|Data Type|Value Range|Description|Categorical/Continuous Variable|
|:-----|:--------|:----------|:----------|:-----:|
|ENG_TYPE|Integer|0-1|The type of engine in the vehicle.|Categorical|
|CYLINDERS|Integer|4-8| The number of cylinders in the vehicle's engine.|Categorical|
|SIZE|Integer|>89| Engine displacement (larger number = bigger engine). |Continuous|
|HP|Integer|>65| Engine horsepower. Horsepower is treated as a continuous variable.|Continuous|
|WEIGHT|Integer|>0| Car weight. The car weight in kilo-pounds. Weight is considered to be a continuous variable.   |Continuous|
|ACCEL|Integer|>0|Acceleration, is representative of how fast the car can increase speed to 60 mph starting when the car is not in motion (0 mpg). This variable is considered to be a continuous variable.|Continuous|
|MPG|Integer|>0|The response variable is miles per gallon (mpg). |RESPONSE VARIABLE|


### Brief Background of Multiple Imputation (MI)
The challenge of analyzing any data set by a data analyst or data scientist is how to solve the problem of data sets that may be missing some values.  From our studies in the class, there are many ways to address the problem of missing data but what is the best way to handle this problem [1].  The common types of missing data are: (1) MCAR (Missing completely at random), (2) MAR (Missing at Random), (3) MNAR (Missing Not at Random).  

The MSDS 7333 "Quantifying the World" lecturer stated that there are two approaches that may be used when data are missing from the data set: (1) Fill in the blanks based on other variables which are highly correlated with the missing values, or (2) use the incomplete data set, omitting records that contain the missing values [2].   

For this case study, we have been challenged to use the multiple imputation procedures discussed in our class lectures to resolve the missing values in the data set [3].  The "Multiple Imputation" highlights are:

- Missing values are replaced by values calculated via existing values from other variables,
- Newly generated values stand in for missing values,
- Imputed values are drawn from distributions that reflect specific uncertainties about the existing values,
- Multiple data sets are created, each containing a different set of imputes,
- Standard techniques are then used to analyze each data set, and
- Results are combined into the overall analysis [3].


The authors or investigators will use the SAS PROC MI procedure to determine the missing values in the data set [3].   The highlights of the SAS MI procedure are [3]:
- (1) Create the Data Sets,
- (2) Analyze the Imputed Data Sets
- (3) Combine Analysis Results 


As we stated earlier, our ultimate goal is to fit the model and predict the value the response variable "mpg".  Again, the objective of the Case Study 2 are:
- Analyze this data set using multiple imputation.
- Use PROC MI to discover the missing value patterns and to decide what MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deletion results.

<a id="methods"></a>
## 3 - Methods

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Methods (<b>30 points total</b>)</h3>
<ol>
    <li>Brief descriptive analysis of the data</li>
    <li>Discover correct missing values patterns.</li>
    <li>Explain missing data mechanism assumption and why it's reasonable</li>
    <li>Describe creation of multiple imputed data sets</li>
    <li>Descriptive analysis of multiple imputed data sets & comparison to listwise/complete data sets</li>
</ol>

<div style='margin-left:0%;margin-right:72%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<p>1. Brief Descriptive analysis of the data.</p>

By running a PROC MEANS procedure on the inital data set, we are met with the following table: 

|Variable|N|Mean|Std Dev|Minimum|Maximum
|:----|:----|:----|:----|:----|:----|
|MPG|38|24.7605263|6.5473138|15.5|37.3|
|Cylinders|34|5.3235294|1.6090763|4|8|
|Size|35|180.8857143|91.4239547|85|360|
|HP|33|101.3333|27.1185668|65|155|
|Weight|32|2.9056250|0.7091906|1.915|4.36|
|Accel|34|14.9441176|1.5897763|11.3|19.2|
|Eng_Type|35|0.2857143|0.45483492|0|1|


At first glance, we can see this is a standard descriptive statistics table. Upon further inspection, we see there are differing values of 'N' for each of the variables. While MPG has 38, others such as Weight and HP have 32 and 33 respective entries. This is a data set with missing data! 

## "CARMPG" Data Set (Partial)


<img src='imgs/car-mpg-snippet-dataset.jpg'>


As you can see, we have quite a few of missing values throughout the CARMPG data set.

From above, we have verified both by the SAS PROC REG statement and by visual inspection of the data set, that we had missing data values.

<div style='margin-left:0%;margin-right:70%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<p>2. Discover correct missing values patterns.</p>

## Investigation into How Many Values are Missing in the Data Set


We can use SAS PROC REG to determine how many observations from the "CARMPG" data set will be deleted.   This is a novel way of determing this information quickly.

<img src='imgs/SummaryOfMissingObservations.jpg'>


Based on the PROC REG results, out of the 38 observations, 20 observations were deleted because of missing values, and only 18 observations were kept.  This should cause us some concern.





Note:
Since ENG_TYPE and Cylinders are both categorical variables, the more formal way of generating the linear regression equation would be to use the dummy variables approach for the categorical variables.   In the initial "CARMPG" data set, there were missing values for the engine type and cylinders variables.   We could have used either the PROC GLM or the PROC MIXED SAS command to do our analysis to predict the initial linear regression model.  The PROC GLM or PROC SAS command internally handles the categorical variables - dummy variable approach.  When one uses the PROG REG SAS command, you must manually set up the dummy variables for the categorical variables that are used in the linear regression model.   For this exercise, our team did not attempt to do this, we treated all variables as continuous variables in order to focus on the exercise.

To find the frequency of the missing data, we used the SAS procedure PROC MI. This will bring to light all and any patterns exhibited by the missing data if any exist. 

Some things to consider: 
For the missing data pattern to be **Monotone**, we will expect to see all of the missing data to be grouped together on the right side of the table. This pattern is considered to be rare, as the data was likely methodically left out, which classifies the data to be *Missing Not at Random*

<img src="imgs/Monotone Example.jpg">

For the missing data pattern to be considered **Non-Monotone** ( or **arbitrary**), we should expect to see a random distribution of the missing values throughout the data set, with no readily discernible pattern. This pattern is far more common and indicates the data is likely *Missing Completely at Randon*

<img src="imgs/Arbitrary Example.jpg">

After running the PROC MI procedure, we see the results:

The code displays the patterns of missing data, so you can determine if patterns are monotone or non-monotone (arbitrary). 

<img src='imgs/Monotone-Code.jpg'>


<img src="imgs/Initial Missing Pattern.jpg">

With the aforementioned examples in mind, there is no immediately recognizable pattern to the missing-ness of this data. Therefore, it is safe to conclude that the missing data is **Arbitrary** (non-monotone of missingness).

<div style='margin-left:0%;margin-right:50%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<p>3. Explain missing data mechanism assumption and why it's reasonable </p>

A common mechanism for handling missing data is Listwise deletion. With this mechanism, if a record (row) has a missing datum, the entire record is deleted. This works on large data sets with tens of thousands of rows or more (if there are only a handful of missing values), but the repercussions are more pronounced on smaller data sets such as this one. Deleting an entire record may unintentionally obfuscate other characteristics about the dataset which may be vital to an accurate model, or bias the effect on variance. In short, every effort should be made to preserve the original characteristics of a dataset.

Another substitution mechanism employed is Regression Substitution. This mechanism utilizes the original data to determine which values could be substituted into the slots of the missing data. Venturing down this route to complete the data set not only adds statistical power to our resulting models, but it also adds to our certainty of the uncertainty within the dataset. 

Regression Substitution seems to be the most reasonable route to take when handling missing data, as we will see taking multiple imputations of the same data set and setting the "average" of all of them will mimic the results of the Expectation Maximization Algorithm (This technique is also referred to as Stochastic Gradient Descent). It is reasonable to take these steps as the result will be a model which has a lower variance and standard deviation than that of the original dataset, especially compared to the results of listwise deletion.

<div style='margin-left:0%;margin-right:65%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<p>4. Describe creation of multiple imputed data sets</p>

## Create MI data using default MCMC for Non-Monotone

Based on the "CARSMPG" data set, the important Model Information are: (1)the method is MCMC, (2) Multiple Imputation Chain is Single Chain, and (3) the Number of Imputations is 5.

<img src='imgs/Step1-MI.jpg'>

Through the use of the SAS procedure PROC MI, the data set was imputed five times and each imputation is completely independent of the others. Here is a snippet of the resulting imputations: 

<img src='imgs/miout snip.jpg'>

As we can see, the far left side of the PROC MI results show which imputation iteration occurred to show the data on the right. This is the border between the first and second imputation. 

To reiterate, going through this process allows for us to maintain the initial integrity of the data set. Once all five iterations have been executed, the average value for each imputation in all 5 iterations is then kept as the final, chosen value for each missing value. Upon completion of this step, we are now able to apply PROC REG to the final imputed data set to analyze the variance/covariance of what the data set ***should*** look like.

<div style='margin-left:0%;margin-right:35%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<p>5. Descriptive analysis of multiple imputed data sets & comparison to listwise/complete data sets</p>

Once the PROC MI step has been completed, the procedure PROC REG is then executed on the output. This step allows for a regression model to be created from each imputation of the data set. Much like the imputation iterations, each regression model is independent of one another. From these five models, we are able to perform PROC MIANALYZE to combine these models. These models are combined by averaging the individual parameters of each model to create a single, more representative model. This "average" model attempts to account for sample bias as the results are closer to the true estimates of the original dataset, rather than the result of a listwise deletion.

## Run Analysis Using Imputed Data


<img src='imgs/RunMI-ImputedData.jpg'>


Based on the analysis, no observations were deleted, we used the complete set of 38 observations.

<a id="results"></a>
## 4 - Results

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Results (<b>30 points total</b>)</h3>

The results for our case study of using multiple imputation are shown below beginning with the results of `PROC REG` with no imputation and with multiple imputation from `PROC MI` and `PROC MIANALYZE`. A comparison of the results is also provided.

### 4.1.1 No Imputation

The dataset was first analyzed with `PROC REG` for MPG to baseline the dataset and determine if any data was missing. Data was found to be missing in a random pattern. The ANOVA results of the dataset with no imputation is shown in [Figure 4.1](#ANOVA with No Imputation). The total degrees of freedom is 17. 

Examining the fit diagnostics for MPG is shown in [Figure 4.2](#Fit with No Imputation). These diagrams show a fairly random residual cloud, right skewness, and few outliers. There is a strong prediction of MPG from the indpendent variables with an adjusted $R^2=.88$. This means 88% of MPG can be explained by the other variables. No data points  provide strong leverage.

Examining the residual diagrams in [Figure 4.3](#Residual with No Imputation) show two categories; i.e., engine and cylinders, where by the dataset could be further analyzed. The data shows the following:
- there are more four cylinder car engines, 
- the average acceleration is 15 $ft/s^2$,
- the engine displacement size is usually less than 200 $in^3$,
- the engine horsepower is evenly distributed, and
- cars typically weigh less than 3,000 pounds.

The parameters for the linear regression is shown in [Figure 4.4](#Parameters Initial). Only the HP parameter is considered significant at alpha = .05. The parameters clearly show that a smaller engine car yields a higher MPG. For instance, an average MPG is gained when the engine HP is dropped by .195.

![ANOVA with No Imputation](imgs/anova_initial.png "ANOVA with No Imputation")
<p style='text-align: center;'>
Figure 4.1: ANOVA Table with No Imputation
</p>

![Fit with No Imputation](imgs/fit_initial.png "Fit with No Imputation")
<p style='text-align: center;'>
Figure 4.2: Fit with No Imputation
</p>

![Residual with No Imputation](imgs/residuals_initial.png "Residual with No Imputation")
<p style='text-align: center;'>
Figure 4.3: Residual with No Imputation
</p>

![Parameters Initial](imgs/parameters_initial.png "Parameters with No Imputation")
<p style='text-align: center;'>
Figure 4.4: Parameters with No Imputation
</p>

### 4.1.2 With Imputation

A second dataset was then determined for five imputations with `PROC MI` with the `MCMC` option to account for random missingness. Then, `PROC REG` was performed on the new dataset for each imputation. Finally, `PROC MIANALYZE` was used to determine average parameter values giving the five `PROC REG` results.

[Figure 4.5](#ANOVA Imputation Iteration Five) shows the results of imputation iteration five. Notice that the degrees of freedom rose from 17 (original) to 37. The increased degrees of freedom provide more power because of the increased sample size. 

Examining the fit diagnostics for MPG is shown in [Figure 4.6](#Fit with Imputation). These diagrams show a fairly random residual cloud, right skewness, and few outliers. But this time, the residual cloud has better randomness. There is a strong prediction of MPG from the indpendent variables with an adjusted $R^2=.91$ compared to .88 (original). This means 91% of MPG can be explained by the other variables. No data points provide strong leverage.

Examining the residual diagrams in [Figure 4.7](#Residual with Imputation) show two categories; i.e., engine and cylinders, but you can see random data was added for these classes from the imputation. This isn't realistic because engine type and clinders must be from a defined set of values. The data insights are similar to the above no imputation analysis.

The parameters for the linear regression is shown in [Figure 4.8](#Parameters Combined). Cylinders and Engine Type parameters became significant in addition to HP at alpha = .05. This might be due to the non-standard cylinder and engine types imputed. Further analysis should be performed to determine a reason why they became significant. The parameters still show that a smaller engine car yields a higher MPG, but the effect of HP dropped from .195 to .16.

![ANOVA Imputation Iteration Five](imgs/anova_im5.png "ANOVA Table with Imputation Iteration Five")
<p style='text-align: center;'>
Figure 4.5: ANOVA Table with Imputation Iteration Five
</p>

![Fit with Imputation](imgs/fit_combined.png "Fit with Imputation")
<p style='text-align: center;'>
Figure 4.6: Fit with Multiple Imputation
</p>

![Residual with Imputation](imgs/residuals_combined.png "Residual with Imputation")
<p style='text-align: center;'>
Figure 4.7: Residual with Multiple Imputation
</p>

![Parameters Combined](imgs/parameters_combined.png "Parameters with Imputation")
<p style='text-align: center;'>
Figure 4.8: Parameters with Multiple Imputation
</p>

### 4.1.3 Comparison of No Imputation and Multiple Imputation

[Table 4.9](#parameters) shows a comparison of parameter coefficients from `PROC REG` with no imputation of missing data and the parameter coefficients from `PROC MIANALYZE` with five iterations of imputation. The standard error was reduced for all parameters significantly (ranging from 27% to 56%) except for engine size. This means our combined sample is more representative of the population. This leads us to believe that the parameter estimates are more representative as well.

|Variable    |Orig Estimate|Orig Std Error|Combined Estimate|Combined Std Error|Diff Estimate|Diff Std Error|Diff % Std Error|
|:-----------|:-----------:|:------------:|:---------------:|:----------------:|:-----------:|:------------:|:---:|
|intercept   | 70.148      |8.038         |69.543           |4.676             |-0.605       |-3.362        |-41.8|
|cylinders   |-3.334       |1.561         |-2.892           |0.767             | 0.442       |-0.794        |-50.9|
|size        | 0.0228      |0.032         | 0.031           |0.0217            | 0.008       |-0.010        |-0.03|
|horsepower  |-0.195       |0.081         |-0.159           |0.0461            | 0.036       |-.0349        |-43.1|
|weight      |-0.306       |5.133         |-3.215           |3.740             |-2.909       |-1.393        |-27.1|
|acceleration|-0.782       |0.583         |-0.722           |0.410             | 0.060       |-0.173        |-29.7|
|engine type | 6.599       |3.590         | 5.855           |1.580             |-0.744       |-2.010        |-56.0|

<a id="parameters"></a>
<p style='text-align: center;'>
Table 4.9: Parameters Comparison of No Imputation and Multiple Imputation
</p>

### 4.1.4 Comparison of Original Analysis with Imputation Analysis

<img src='imgs/ComparisonOriginal-Imputation.jpg'>


Here the degrees of freedom for the Imputation with the complete Data Set is 37 whereas the degrees of freedom for the original data set is 17, and mean squared error for the imputed data set is lower than for the original.

<a id="conclusion"></a>
## 5 - Conclusion

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Conclusion (<b>5 points total</b>)</h3>

Case study two, using multiple imputation, for cars MPG dataset illustrates how an investigator can best analyze a dataset with missing data, determine its missingness and pattern, and adapt a strategy allowing for most if not all data to be used. Using more data in the model will provide it with more power; thereby, more accurately representing its population. Multiple imputation provides an analysis that does the following:
- "reflects the uncertainty due to missing values,
- creates a representative random sample of missing values,
- is typically better than single imputation methods because it results in valid statistical inferences that reflect the uncertainty due to missing values." [4]

The combined parameter estimates that are listed in [Table 4.9](#parameters), provide us with confidence that a good set of estimates are provided based on the natural variability within the original data set [4].   In our study, two parameters were categorical (Engine Type and Cylinders) and were imputed as if they were continuous variables. It is uncertain to us at this point if this approach has an adverse effect on the results. Further study is required [5]. 

The raw data for us to work with contained missing values. The missing data items were given a character value of '.' to indicate to SAS that this continuous value is blank.

The case study was performed in SAS. In R, the approach might look like the following [6]:
- read data into a dataframe
- use summary() to determine the number of missing data per variable
- use pattern() from Hmisc to characterize missingness
- use glm() to fit a regression model for baseline parameters
- use mice() from mice to impute missing values (Need research the equivalent of MCMC SAS option)
- use glm.mids() to combine results
- use summary() to view new parameter values

#### Future Work
Below are future work items for consideration:
- Research best way to replace missing data with appropriate SAS character such as '.' for continuous data. Obviously, we can use Pandas, Excel, etc., but can we transform the data via SAS?
- Rerun the analysis with R code and compare with SAS.
- Research best practice for imputing categorical variables.

<a id="biblio"></a>
## 6 - Bibliography and Citation

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Bibliography and Citation (<b>5 points total</b>)</h3>

- [1], 1.3 The Problem: Missing Values in Data Sets, MSDS 7333 Quantifying the World

- [2], 1.4 Challenges to Missing Data, MSDS 7333 Quantifying the World

- [3], 1.7 Multiple Imputation, MSDS 7333 Quantifying the World

- [4], 2.3 PROC MI Example II, MSDS 7333 Quantifying the World

- [5], Kropko, Jonathan et. al.; "Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches"; http://www.stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf, 2013.

- [6], Kleinman, Ken and Horton, Nicholas J.; SAS and R Data Management, Statistical Analysis, and Graphics; pg. 309-311; 2014.

<a id="code"></a>
## 7 - Code

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Code (5 points)</h3>


```
/*
MSDS7333-401: Quantifying the World 

Case Study 2
- Use PROC MI to discover the missing values patterns and to decide what 
	MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information
	to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deletion results.
*/

* load dataset;
data cars;
	infile "e:\qtw_02\carmpg_2_2_2_2.csv" firstobs=2 delimiter=',';
	input auto :$23. mpg cylinders size hp weight accel engtype;
run;

* print out entries;
title 'MPG Dataset';
proc print data=cars;
run;

* contents of dataset;
title 'MPG Dataset Contents';
proc datasets;
   contents data=_all_;
run;

* what data is missing from dataset?;
* use PROC REG with listwise deletion;
title 'Predicting MPG (initial)';
proc reg data=cars;
	model mpg = cylinders size hp weight accel engtype;
run;
quit;

* is the missing data monotone or non-monotone?;
* the data is non-monotone;
title 'MI Pattern';
ods select misspattern;
proc mi data=cars nimpute=0;
	var mpg cylinders size hp weight accel engtype;
run;

* create mi data using default MCMC for non-monotone;
title 'MI with MCMC';
proc mi data=cars out=miout seed=35399 nimpute=5;
	var mpg cylinders size hp weight accel engtype;
run;

* run reg with mi data;
title 'Predicting MPG with MI (final)';
proc reg data=miout outest=outreg covout;
	model mpg = cylinders size hp weight accel engtype;
	by _Imputation_;
run;

* combine results;
title 'Predicting MPG (combined)';
proc mianalyze data=outreg;
	modeleffects Intercept cylinders size hp weight accel engtype;
run;
```
