# MSDS 7330 - Case Study One: Using Multiple Imputation

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab2)
- [Ben Brock](bbrock@smu.edu?subject=lab2)
- [Tom Elkins](telkins@smu.edu?subject=lab2)
- [Austin Kelly](ajkelly@smu.edu?subject=lab2)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Instructions</h3>
    <p>Carry out the following steps for Case Study One: Using Multiple Imputation.</p>
     <p>Steps:</p>
    <ol>
        <li>Use PROC MI to discover the missing values patterns and to decide what MI options to use. (Assume no need for transformations.)</li>
        <li>Use PROC MI to create multiple imputed data sets.</li>
        <li>Use PROC REG to analzye the multiple data sets while outputting information to be used in MIANALYZE.</li>
        <li>Use PROC MIANALYZE to summarize the imputed analyses.</li>
        <li>Compare these results to the listwise deletion results.</li>
    </ol> 
    <p>Report Sections:</p>
    <ol>
        <li>[Introduction](#introduction) <b>(5 points)</b></li>
        <li>[Background](#background) <b>(10 points)</b></li>
        <li>[Methods](#methods) <b>(30 points)</b></li>
        <li>[Results](#results) <b>(30 points)</b></li>
        <li>[Conclusion](#conclusion) <b>(5 points)</b></li>
        <li>[Bibliography and Citation](#biblio) <b>(5 points)</b></li>
        <li>[Code](#code) <b>(5 points)</b></li>
    </ol>
     <p>Other Grading Criterium:</p>
    <ol>
        <li>Grammar and Organization <b>(10 points)</b></li>
    </ol>
</div>

<a id='introduction'></a>
## 1 - Introductions
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Introductions (<b>5 points total</b>)</h3>
</div>

### Predicting Miles per Gallon from Engine Type, Cylinders, Size, Weight, Horsepower, and Acceleration.

Miles per gallon is a very important attribute when new car owners or seasoned car owners plan to purchase a new vehicle.  The “CARMPG” data set contains data for 38 cars as measured in 2005.  In today’s global economy, car owners can purchase various domestic or foreign car models that are fuel-efficient, environment friendly and easy to maintain. 


In this exercise, we will predict the miles per gallon (mpg) of the vehicle based on the following attributes: Engine Type, Cylinders, Size, Horsepower, Weight and Acceleration.


Objective of Case Study 2
- Use PROC MI to discover the missing values patterns and to decide what 
	MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information
	to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deltion results.


#### NOTE: Based on the objective of this case study, our team decided to use the SAS programming language to solve the problem.


<a id="background"></a>
## 2 - Background

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Background (<b>10 points total</b>)</h3>
</div>

## Background: Brief Background of Data Set and Multiple Imputation

### Brief Background of the "CARMPG" Data Set
The response variable is mpg, there is only one categorical variables which is the variable Cylinder, and the continuous variables are Size, Horsepower (HP), Weight and Acceleration (ACCEL).  The data set consists of a sample of 38 observations based on different aspects of car attributes (mpg, Engine Type, Cylinders, Size, Horsepower, Weight, and Acceleration).  The objective of this exercise is to examine the relationship between each of these predictor variables and the response variable of mpg in hopes that a customer will be able to identify important relationships and address specific mpg factors when purchasing a vehicle. The ultimate goal is to fit a model using some or all of these variables to predict mpg. 


### Brief Background of Multiple Imputation (MI)
The challenge of analyzing any data set by a data analyst or data scientist is how to solve the problem of data sets that may be missing some values.  From our studies in the class, there are many ways to address the problem of missing data but what is the best way to handle this problem [1].  The common types of missing data are: (1) MCAR (Missing completely at random), (2) MAR (Missing at Random), (3) MCAR (Missing Completely at Random), (4) MAR (Missing at Random), and (5) MNAR (Missing Not at Random).  

There MSDS Quantifying the World lecturer stated that there are two approaches to use when the data are missing from the data set: (1) Fill in the blanks based on other variables which are highly correlated with the missing values, or (2) use the incomplete data set, omitting records that contain the missing values [2].   

For this case study, we have been challenged to use the multiple imputation procedures discussed in our class lectures to resolve the missing values in the data set [3].  The "Multiple Imputation" highlights are:

- Missing values are replaced by values caluculated via existing values from other variables,
- Newly generated values stand in for missing values,
- Imputed values are drawn from distributuions that reflect specific uncertainties about the existing values,
- Multiple data sets are created, each containing a different set of imputes,
- standard techniques are then used to analyze each data set, and
- results are combined into the overall analysis [3].


The authors or investigators will use the SAS PROC MI procedure to determine the missing values in the data set [3].   The highlights of the SAS MI procedure are [3]:
- (1) Create the Data Sets,
- (2) Analyze the Imputed Data Sets
- (3) Combine Analysis Results 


As we stated earlier, our ultimate goal is to fit the model and predict the value the response variable "mpg".  Again, the objective of the Case Study 2 are:
- Analyze this data set using multiple imputation.
- Use PROC MI to discover the missing values patterns and to decide what 
	MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information
	to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deltion results.



#### Field Definitions
Variables used in the CARMPG data set. 

|Column|Data Type|Value Range|Description|Categorical/Continuous Variable|
|:-----|:--------|:----------|:----------|:-----:|
|ENG_TYPE|Integer|0-1|The type of engine in the vehicle.|Categorical|
|CYLINDERS|Integer|4-8| The number of cylinders in the vehicle.|Continuous|
|SIZE|Integer|>89| Engine displacement (larger number = bigger engine). |Continuous|
|HP|Integer|>65| Engine horsepower. Horsepower is measured in rpm (revolution per minute) and is treated as a continuous variable.|Continuous|
|WEIGHT|Integer|>0| Car weight. The car weight in kilo-pounds of how much the car weighs. Weight is considered to be a continuous variable.   |Continuous|
|ACCEL|Integer|>0|Acceleration, is representative of how fast the car can increase speed to 60 mph starting when the car is not in motion (0 mpg). This variable is considered to be a continuous variable.|Continuous|
|MPG|Integer|>0|The response variable is mpg. |RESPONSE VARIABLE|


<a id="methods"></a>
## 3 - Methods

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Methods (<b>30 points total</b>)</h3>

Methods

<a id="results"></a>
## 4 - Results

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Results (<b>30 points total</b>)</h3>

![ANOVA with No Imputation](imgs/anova_initial.png "ANOVA with No Imputation")
<p style='text-align: center;'>
Figure ?: ANOVA Table with No Imputation
</p>

![Fit with No Imputation](imgs/fit_initial.png "Fit with No Imputation")
<p style='text-align: center;'>
Figure ?: Fit with No Imputation
</p>

![Residual with No Imputation](imgs/residuals_initial.png "Residual with No Imputation")
<p style='text-align: center;'>
Figure ?: Residual with No Imputation
</p>

![Parameters Initial](imgs/parameters_initial.png "Parameters with No Imputation")
<p style='text-align: center;'>
Figure ?: Parameters with No Imputation
</p>

![ANOVA Imputation Iteration Five](imgs/anova_im5.png "ANOVA Table with Imputation Iteration Five")
<p style='text-align: center;'>
Figure ?: ANOVA Table with Imputation Iteration Five
</p>

![Fit with Imputation](imgs/fit_combined.png "Fit with Imputation")
<p style='text-align: center;'>
Figure ?: Fit with Imputation
</p>

![Residual with Imputation](imgs/residuals_combined.png "Residual with Imputation")
<p style='text-align: center;'>
Figure ?: Residual with Imputation
</p>

![Parameters Combined](imgs/parameters_combined.png "Parameters with Imputation")
<p style='text-align: center;'>
Figure ?: Parameters with Imputation
</p>

|Variable    |Orig Estimate|Orig Std Error|Combined Estimate|Combined Std Error|Diff Estimate|Diff Std Error|
|:-----------|------------:|-------------:|----------------:|-----------------:|------------:|-------------:|
|intercept   | 70.148      |8.038         |69.543           |4.676             |-0.605       |-3.362        |
|cylinders   |-3.334       |1.561         |-2.892           |0.767             | 0.442       |-0.794        |
|size        | 0.0228      |0.032         | 0.031           |0.0217            | 0.008       |-0.010        |
|horsepower  |-0.195       |0.081         |-0.159           |0.0461            | 0.036       |-.0349        |
|weight      |-0.306       |5.133         |-3.215           |3.740             |-2.909       |-1.393        |
|acceleration|-0.782       |0.583         |-0.722           |0.410             | 0.060       |-0.173        |
|engine type | 6.599       |3.590         | 5.855           |1.580             |-0.744       |-2.010        |

<p style='text-align: center;'>
Table ?: Parameters Comparison
</p>

<a id="conclusion"></a>
## 5 - Conclusion

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Conclusion (<b>5 points total</b>)</h3>

Conclusion

<a id="biblio"></a>
## 6 - Bibliography and Citation

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Bibliography and Citation (<b>5 points total</b>)</h3>

- [1], 1.3 The Problem: Missing Values in Data Sets, MSDS 7333 Quantifying the World

- [2], 1.4 Challenges to Missing Data, MSDS 7333 Quantifying the World

- [3], 1.7 Multiple Imputation, MSDS 7333 Quantifying the World

<a id="code"></a>
## 6 - Code

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Code</h3>


```
/*
MSDS7333-401: Quantifying the World 

Case Study 2
- Use PROC MI to discover the missing values patterns and to decide what MI options to use. (Assume no tranformations.)
- Use PROC MI to create multiple imputed data sets.
- Use PROC REG to analyze the multiple data sets while outputting information to be used in MIANALYZE.
- Use PROC MIANALYZE to summarize the imputed analyses.
- Compare these results to the listwise deltion results.
*/

* load dataset;
data cars;
    infile "e:\qtw_02\carmpg_2_2_2_2.csv" firstobs=2 delimiter=',';
    input auto :$23. mpg cylinders size hp weight accel engtype;
run;

* print out entries;
title 'MPG Dataset';
proc print data=cars;
run;

* contents of dataset;
title 'MPG Dataset Contents';
proc datasets;
   contents data=_all_;
run;

* what data is missing from dataset?;
* use PROC REG with listwise deletion;
title 'Predicting MPG (initial)';
proc reg data=cars;
    model mpg = cylinders size hp weight;
run;
quit;

* is the missing data monotone or non-monotone?;
* the data is non-monotone;
title 'MI Pattern';
ods select misspattern;
proc mi data=cars nimpute=0;
    var mpg cylinders size hp weight;
run;

* create mi data using default MCMC for non-monotone;
title 'MI with MCMC';
proc mi data=cars out=miout seed=35399 nimpute=5;
    var mpg cylinders size hp weight;
run;

* run reg with mi data;
title 'Predicting MPG with MI (final)';
proc reg data=miout outest=outreg covout;
    model mpg = cylinders size hp weight;
    by _Imputation_;
run;

* combine results;
title 'Predicting MPG (combined)';
proc mianalyze data=outreg;
    modeleffects Intercept cylinders size hp weight;
run;
```
