# Linear models case study

- [Linear models case study](#linear-models-case-study)
  - [Topic Choices](#topic-choices)
    - [Choice 1: Forecast HIV incidence in US counties](#choice-1-forecast-hiv-incidence-in-us-counties)
      - [Major challenges](#major-challenges)
    - [Choice 2: Predict the sale price of heavy equipment at auction](#choice-2-predict-the-sale-price-of-heavy-equipment-at-auction)
      - [Major challenges](#major-challenges-1)
  - [Deliverables](#deliverables)

## Topic Choices

Depending on your campus (check with your instructors), you and your group have 
two options for this case study.


### Choice 1: [Forecast HIV incidence in US counties](forecast_HIV_infections/README.md)

**Using data merged from several databases, you are asked to build a model that
predicts HIV incidence for US counties.**  You should also identify and report
the most significant drivers of HIV infection and how they vary between counties.

Read more [here](./forecast_HIV_infections/README.md).

#### Major challenges
1. Non-normal distribution of features, possibly requires thoughtful, domain knowledge based feature engineering, data preprocessing or feature tranformation
2. Require manual splitting of training data and test data.
3. Data comes from various sources, which requires careful reading to understand the exact meaning of the data.
4. Need to be aware of possible data leaking

### Choice 2: [Predict the sale price of heavy equipment at auction](predict_auction_price/README.md)

**Predict the sale price of a particular piece of heavy equipment at auction based
on its usage, equipment type, and configuration.**  The data is sourced from auction
result postings and includes information on usage and equipment configurations.

Read more [here](./predict_auction_price/README.md).

#### Major challenges
1. Need to learn to use script to evaluate results
2. Instruction information for completion is more than [Choice 1: Forecast HIV incidence in US counties](#choice-1-forecast-hiv-incidence-in-us-counties).
3. Lots of missing data in data set.
4. Need to be aware of possible data leaking

## Deliverables

At the end of the day your group will be expected to present for 5-10
minutes on your findings.  Present results from your README.md.

Cover the following in your presentation.

   1. Talk about what you planned to accomplish
   2. How you organized yourselves as a team (including your git workflow)
   3. Description of the problem and the data
   4. What you accomplished (how you chose model, performance metric, validation)
   5. Performance on unseen data
   5. Anything new you learned along the way


# Predict Heavy Equipment Auction Price

- [Predict Heavy Equipment Auction Price](#predict-heavy-equipment-auction-price)
  - [Case Study Goal](#case-study-goal)
    - [Evaluation Metrics](#evaluation-metrics)
  - [Backgrounds](#backgrounds)
  - [Data](#data)
  - [Overview of the `loss_model.py` script](#overview-of-the-loss_modelpy-script)
  - [Credit](#credit)
  - [Appendix](#appendix)
    - [Important Tips](#important-tips)
    - [Restrictions](#restrictions)

## Case Study Goal
Predict the `sale price` of a particular piece of heavy equipment at auction, based
on its usage, equipment type, and configuration.  The data is sourced from auction
result postings and includes information on usage and equipment configurations.

### Evaluation Metrics
The evaluation of your model will be based on Root Mean Squared Log Error.
Which is computed as follows:

![Root Mean Squared Logarithmic Error](images/rmsle.png)

where *p<sub>i</sub>* are the predicted values (predicted auction sale prices) 
and *a<sub>i</sub>* are the actual values (the actual auction sale prices).

Note that this loss function is sensitive to the *ratio* of predicted values to
the actual values, a prediction of 200 for an actual value of 100 contributes
approximately the same amount to the loss as a prediction of 2000 for an actual
value of 1000.  To convince yourself of this, recall that a difference of
logarithms is equal to a single logarithm of a ratio, and rewrite each summand
as a single logarithm of a ratio.

This loss function is implemented in [`loss_model.py`](./loss_model.py). Read it to understand how it works.

## Backgrounds

Please check the original Kaggle contest [page]((https://www.kaggle.com/c/bluebook-for-bulldozers)).

## Data
The data for this case study are in `./data`. Although there are both training
and testing data sets, the testing data set [test.csv](./data/test.csv) should **`only`** be utilized to evaluate
your final model performance at the end of the day. Also, the ground truth of the test data is in [ground_truth/test_actual.csv](./data/ground_truth/test_actual.csv). Think about it as your
hold out set.  Use cross-validation on the training data set [train_1.csv](./data/train_1.csv) and [train_2.csv](./data/train_2.csv) to identify your
best model and report the performance of your best model on the test data at the end of the day.

> Hint: use `cat` to concatenate `train_1.csv` and `train_2.csv` to a single `train.csv` file.
  ```bash
  cat train_1.csv train_2.csv > train.csv
  ```

By using the same test data and the same evaluation metric (RMSLE) the relative
performance of different group's models on this case study can be assessed.

A [data_dictionary.csv](./data/data_dictionary.csv) is included that explains the columns in the data.

## Overview of the `loss_model.py` script
Included is a loss function to test your predictions of the test set against the provided hold out test set.  This follows a common setup in competitions such as Kaggle, where this came from.  In these types of setups, there is a labeled train set to do your modeling and feature tuning.  There is also a provided hold-out test set to compare your predictions against.  You will need to fit a model on the training data and get a prediction for all the data in the test set.  You will then need to create csv containing the field 'SalesID' and 'SalePrice' (must match exactly). An example file is created for you [Example_Output.csv](./data/Example_Output.csv). This csv file will be the input parameter to running the function.
Example:
In terminal:

```bash
python loss_model.py <path to csv file>
```

For example, the following command returns `0.7802091986822471`
```bash
python loss_model.py data/Example_Output.csv
```



## Credit
This case study is based on [Kaggle's Blue Book for Bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers) competition.  The best RMSLE was only 0.23 (obviously lower is better).
>  Note that if you were to simply guess the median auction price for all the pieces of equipment in the test set you would get an RMSLE of about 0.7.

## Appendix

### Important Tips

1. This data is quite messy. Try to use your judgement about where your
cleaning efforts will yield the most results and focus there first.
2. Because of the restriction to linear models, you will have to carefully
consider how to transform continuous predictors in your model.
3. Remember any transformations you apply to the training data will also have
to be applied to the testing data, so plan accordingly.
4. Any transformations of the training data that *learn parameters* (for
example, standardization learns the mean and variance of a feature) must only
use parameters learned from the *training data*.
5. It's possible some columns in the test data will take on values not seen in
the training data. Plan accordingly.
6. Use your intuition to *think about where the strongest signal about a price
is likely to come from*. If you weren't fitting a model, but were asked to use
this data to predict a price what would you do? Can you combine the model with
your intuitive instincts?  This is important because it can be done *without
looking at the data*; thinking about the problem has no risk of overfitting.
7. Start simply. Fit a basic model and make sure you're able to get the submission 
working then iterate to improve.

8. Remember that you are evaluated on a loss function that is only sensitive to
the *ratios* of predicted to actual values.  It's almost certainly too much of
a task to implement an algorithm that minimizes this loss function directly in
the time you have, but there are some steps you can take to do a good job of
it.

### Restrictions
Please use only *regression* methods for this case study.  The following techniques
are allowed.

  - Linear Regression.
  - Logistic Regression.
  - Median Regression (linear regression by minimizing the sum of absolute deviations).
  - Any other [GLM](http://statsmodels.sourceforge.net/devel/glm.html).
  - Regularization: Ridge and LASSO.

You may use other models or algorithms as supplements (for example, in feature
engineering), but your final submissions must be losses from a linear type
model.

## Credit
This case study is based on [Kaggle's Blue Book for Bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers) competition.  The best RMSLE was only 0.23 (obviously lower is better).  Note
that if you were to simply guess the median auction price for all the pieces of equipment in
the test set you would get an RMSLE of about 0.7.

# Forecasting-HIV-Infections Case Study
- [Forecasting-HIV-Infections Case Study](#forecasting-hiv-infections-case-study)
  - [Case Study Goal](#case-study-goal)
  - [Background](#background)
  - [Data](#data)
    - [Data merging](#data-merging)
  - [Credit](#credit)
## Case Study Goal
1)	To accurately model HIV `incidences` (new infections per 100,000) in US
counties by building a linear regression model that utilizes HIV infection data, census data, data on the opioid crisis, and data on sexual orientation.

2)	Identify features that are the most significant drivers of HIV infection rates and learn how these drivers differ between different regions.

## Background
Due to the development of anti-retroviral therapies the HIV/AIDS epidemic is 
generally considered to be under control in the US.  However, as of 2015 there 
were 971,524 people living with diagnosed HIV in the US with an estimation of 
37,600 new HIV diagnoses in 2014.  HIV infection rates continue to be particularly
problematic in communities of color, among men who have sex with men (MSM), the
transgender community, and other vulnerable populations in the US. Socioeconomic 
factors are a significant risk factor for HIV infection and likely contribute 
to HIV infection risk in these communities.  The current US opioid crisis has 
further complicated the efforts to combat HIV with HIV infection outbreaks now 
hitting regions that weren’t previously thought to be vulnerable to such outbreaks.  

A model that can accurately forecast regional HIV infection rates would be 
beneficial to local public health officials.  Provided with this information, 
these officials will be able to better marshal the resources necessary to combat
HIV and prevent outbreaks from occurring.  Accurate modeling will also identify 
risk factors for communities with high HIV infection rates and provide clues 
as to how officials may better combat HIV in their respective communities.


## Data


The `./data` folder contains data from three publically available sources.  Groups should feel
free to supplement this data if they wish.
1. The largest collection of HIV and opioid data was obtained from the [opioid database](http://opioid.amfar.org/) maintained by the American Foundation for AIDS Research (amfAR).  
2. Demographic and economic data were obtained from the 5yr - American Community Survey which are available at the [US census bureau website](https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t).
3. Estimates for the [MSM population](http://emorycamp.org/item.php?i=48) in each county were obtained from the Emory Coalition for Applied Modeling for Prevention (CAMP).

Data dictionaries that indicate what each column in the data means are included in the folder associated with each data set.

## Data merging  

The `merge_data.ipynb` notebook reads and merges most of the data in the 
`data` folder into one dataframe.  Read through and execute this notebook cell-by-cell to
better understand the data and bring it together for EDA.


## Credit
This case study is based on [Eric Logue's capstone project](https://github.com/elogue01/Forecasting-HIV-Infections).  
You may wish to consult his Github repository devoted to this analysis for inspiration and insight.
