<img src="../../predictioNN_Logo_JPG(72).jpg" width=200>

---

## Coding Assignment 4: Data Transformations

### Introduction to Data Science
#### Last Updated: August 21, 2022

---

### Skills Assessed

You will demonstrate these skills in the HW:
- subsetting a pandas dataframe
- working with lists
- scaling data
- binarizing data
- merging data

---

### Instructions

You will show off your data transformation skills in this assignment.

**About the Data**

The dataset is the [2017 Workplace Health in America survey](https://www.cdc.gov/workplacehealthpromotion/survey/data.html) which was conducted by the Centers for Disease Control and Prevention.  
For details, refer to this [guidance document](https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Guidance-Document-for-Use-of-Public-Data-files-508.pdf).

The data contains over 300 features including:
- industry of employment
- company type
- type of health insurance programs offered
- whether remote work was allowed
- gender
- age

Follow the framed-out code below, filling in the missing pieces.

**TOTAL POINTS: 10**

---

In [1]:
import warnings
warnings.filterwarnings("ignore")

Review the instructions below to figure out which modules are required. Import them all here.

## 1. Data Loading and Prep

The data is stored in a csv file here: https://www.cdc.gov/workplacehealthpromotion/survey/data.html

1) **(1 PT)** Load the data directly into Python without manually downloading the data onto your hard drive. Show the first five rows of data.

Note: The columns are separated by '~', so you'll want to pass a parameter to your function to handle this.

It's always a good idea to check the data types of the columns. Let's do this:

In [4]:
df.dtypes

OC1                      object
OC3                     float64
HI1                     float64
HI2                     float64
HI3                     float64
                         ...   
CDC_Region              float64
Industry                float64
Size                    float64
Varstrata               float64
Finalwt_worksite,,,,     object
Length: 301, dtype: object

Hmm. `OC1` seems like an integer, but it's considered an object. Something strange is going on.  
It turns out there is a bad row of data! It's the row with index 1662. Let's look at the data:

In [5]:
df.iloc[1662,:]

OC1                     1~4~2~1~2~1~2~96~96~96~1~1~1~1~1~1~5~3~1~5~96~...
OC3                                                                   NaN
HI1                                                                   NaN
HI2                                                                   NaN
HI3                                                                   NaN
                                              ...                        
CDC_Region                                                            NaN
Industry                                                              NaN
Size                                                                  NaN
Varstrata                                                             NaN
Finalwt_worksite,,,,                                                  NaN
Name: 1662, Length: 301, dtype: object

Let's remove this row

In [6]:
df.drop(1662, inplace=True)

2) **(2 PTS)** The data contains 301 columns and you don't need most of them. Create a new dataframe called `work` containing only the columns listed below. Show the first five rows in `work`.

* `Industry`: 7 Industry Categories with NAICS codes

* `Size`: 8 Employee Size Categories

* `OC3` Is your organization for profit, non-profit, government?

* `HI1` In general, do you offer full, partial or no payment of premiums for personal health insurance for full-time employees?

* `HI2` Over the past 12 months, were full-time employees asked to pay a larger proportion, smaller proportion or the same proportion of personal health insurance premiums?

* `HI3`: Does your organization offer personal health insurance for your part-time employees?

* `CP1`: Are there health education programs, which focus on skill development and lifestyle behavior change along with information dissemination and awareness building?

* `WL6`: Allow employees to work from home?

* Every column that begins `WD`, expressing the percentage of employees that have certain characteristics at the firm.  
  **Hint: see if you can find a function that helps with this.**

3) **(1 PT)** Next, let's understand if the variables are on different scales. Compute their minimum, maximum, standard deviation and mean.  
It is fine to print additional statistics as well.

4) **(1 PT)** Briefly explain a benefit of scaling data.

5) **(1 PT)** It appears that the ranges of some of the variables are very different. Let's standardize the data.  
Save the scaled data in `work_scaled` and print it.

Notice that `work` is a dataframe, but `work_scaled` is an array. This might be surprising!  
Recall that dataframes store data as numpy arrays as well.  
**Changing data types can be a source of confusion and bugs. It's good practice to check the data type and dimension of objects.**

In [11]:
type(work_scaled)

numpy.ndarray

This confirms `work_scaled` is a numpy array

6) **(1 PT)** Let's make sure the shape of `work_scaled` matches the shape of `work`. Print the dimensions (rows, columns) of each object.

7. **(1 PT)** Let's make sure the scaled variables have standard deviation close to one. Print their standard deviations.  
Hint: since `work_scaled` is a numpy array, you'll want to import a package to do the computation.  
Also, there are 16 variables, so you should see 16 standard deviations in an array.

8. **(1 PT)** Next, you would like to binarize the variable `CP1` in dataframe *work*.  
Specifically, create a new variable in the *work* dataframe named `CP1_bin`, where:  

```
CP1_bin = 1 if CP1 >= 2
CP1_bin = 0 if CP1 < 2
```

Print the first five rows of the dataframe to show the new column.

---

Next, `CP1_bin` is converted to an array (specifically, a column vector):

In [None]:
cpi_bin = work.CP1_bin.values.reshape(-1,1)
cpi_bin

9. **(1 PT)** As a final step, you'd like to merge the two numpy arrays `work_scaled` and `cpi_bin` into a new array called `final`.  
This will make it easier to do machine learning on the data later. Merge these arrays together and print out the result.  
Hint: consider the function `hstack()`

---

Print the shapes of: `work_scaled`, `cpi_bin`, `final`

`work_scaled` should have 16 columns  
`cpi_bin` should have 1 column  
`final` should have 16 + 1 columns.

In [18]:
print(work_scaled.shape)
print(cpi_bin.shape)
print(final.shape)

(2842, 16)
(2842, 1)
(2842, 17)


If you completed this notebook, you have made substantial progress in the course. Congrats!  
You demonstrated that you can load, prepare, scale, binarize and merge data.  
This gets you very close to training a model on data.

---