# Feature Engineering Exercise

In [3]:
import pandas as pd

import os

In this exercise, we will explore different feature engineering ideas.

The feature engineering process can be described as follows:
1. Explore your data and answer important questions about your data
  - What is the format of the data?
  - How much data do we have?
  - Describe the data: what columns do we have for each record, and what values do they contain?
  - Where is the data from?
  - What problem are we trying to solve?
  - __What features might we need to solve this problem?__
2. Form a hypothesis and gather additional data if needed
  - Conduct interviews if needed to gain domain knowledge
3. Experiment with potential features and understand correlations between them
4. Review features with audience and validate domain knowledge
5. Build machine learning models and refine your features

In [4]:
pwd

'C:\\Users\\deesaw\\Desktop\\MLG-07 USI\\day_1\\code'

In [5]:
df = pd.read_csv('C:\\Users\\deesaw\\Desktop\\MLG-07 USI\\day_1\\data\\2019 County Health Rankings Data - v2.csv')

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FIPS,3142.0,30383.649268,15162.508374,1001.0,18177.5,29176.0,45080.5,56045.0
Years of Potential Life Lost Rate,2908.0,8522.752063,2636.067751,2900.0,6697.5,8164.5,10045.5,29783.0
% Fair/Poor,3142.0,17.478994,4.708373,8.0,14.0,17.0,20.0,41.0
Physically Unhealthy Days,3142.0,3.921992,0.714785,2.3,3.4,3.9,4.4,7.2
Mentally Unhealthy Days,3142.0,3.931604,0.614499,2.4,3.5,3.9,4.3,6.0
% LBW,3035.0,8.111697,2.070856,3.0,7.0,8.0,9.0,26.0
% Smokers,3142.0,17.867919,3.667239,7.0,15.0,17.0,20.0,43.0
% Obese,3142.0,32.12508,4.599871,14.0,29.0,32.0,35.0,50.0
Food Environment Index,3123.0,7.465738,1.165679,0.0,6.9,7.7,8.2,10.0
% Physically Inactive,3142.0,25.751432,5.182344,8.0,22.0,26.0,29.0,45.0


In [8]:
df.head()

Unnamed: 0,FIPS,State,County,Years of Potential Life Lost Rate,% Fair/Poor,Physically Unhealthy Days,Mentally Unhealthy Days,% LBW,% Smokers,% Obese,...,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,% Drive Alone,# Workers who Drive Alone,% Long Commute - Drives Alone
0,1001,Alabama,Autauga,8824.0,18,4.2,4.3,8.0,19,38,...,272.0,205.0,74.0,11.7,15,13,2,86,20911,38
1,1003,Alabama,Baldwin,7225.0,18,4.1,4.2,8.0,17,31,...,204.0,708.0,69.0,10.3,14,13,1,85,74415,41
2,1005,Alabama,Barbour,9586.0,26,5.1,4.6,11.0,22,44,...,414.0,96.0,73.0,11.5,15,14,2,83,7242,34
3,1007,Alabama,Bibb,11784.0,20,4.4,4.3,11.0,20,38,...,89.0,113.0,100.0,11.2,11,11,0,86,6930,49
4,1009,Alabama,Blount,10908.0,21,4.5,4.7,8.0,20,34,...,483.0,304.0,105.0,11.7,10,8,2,87,18426,60


In [9]:
df_copy=df.copy()

In [10]:
df_copy.head()

Unnamed: 0,FIPS,State,County,Years of Potential Life Lost Rate,% Fair/Poor,Physically Unhealthy Days,Mentally Unhealthy Days,% LBW,% Smokers,% Obese,...,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,% Drive Alone,# Workers who Drive Alone,% Long Commute - Drives Alone
0,1001,Alabama,Autauga,8824.0,18,4.2,4.3,8.0,19,38,...,272.0,205.0,74.0,11.7,15,13,2,86,20911,38
1,1003,Alabama,Baldwin,7225.0,18,4.1,4.2,8.0,17,31,...,204.0,708.0,69.0,10.3,14,13,1,85,74415,41
2,1005,Alabama,Barbour,9586.0,26,5.1,4.6,11.0,22,44,...,414.0,96.0,73.0,11.5,15,14,2,83,7242,34
3,1007,Alabama,Bibb,11784.0,20,4.4,4.3,11.0,20,38,...,89.0,113.0,100.0,11.2,11,11,0,86,6930,49
4,1009,Alabama,Blount,10908.0,21,4.5,4.7,8.0,20,34,...,483.0,304.0,105.0,11.7,10,8,2,87,18426,60


In [12]:
df['State'].value_counts()

Texas                   254
Georgia                 159
Virginia                133
Kentucky                120
Missouri                115
Kansas                  105
Illinois                102
North Carolina          100
Iowa                     99
Tennessee                95
Nebraska                 93
Indiana                  92
Ohio                     88
Minnesota                87
Michigan                 83
Mississippi              82
Oklahoma                 77
Arkansas                 75
Wisconsin                72
Alabama                  67
Pennsylvania             67
Florida                  67
South Dakota             66
Louisiana                64
Colorado                 64
New York                 62
California               58
Montana                  56
West Virginia            55
North Dakota             53
South Carolina           46
Idaho                    44
Washington               39
Oregon                   36
New Mexico               33
Alaska              

In [None]:
import seaborn as sns
sns.pairplot(df)

KeyboardInterrupt: 

In [None]:
df.isnull()

In [None]:
df.isnull.sum()

In [None]:
df.isnull.any()

In [16]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
    #OnehotEncoding
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse=False,handle_unknown='ignore'), [1])], remainder='passthrough')
X = ct.fit_transform(df)


## Problem: _Predict a County's Health Ranking_

In [19]:
X

array([[1.0, 0.0, 0.0, ..., 86, 20911, 38],
       [1.0, 0.0, 0.0, ..., 85, 74415, 41],
       [1.0, 0.0, 0.0, ..., 83, 7242, 34],
       ...,
       [0.0, 0.0, 0.0, ..., 77, 7262, 18],
       [0.0, 0.0, 0.0, ..., 77, 2845, 11],
       [0.0, 0.0, 0.0, ..., 73, 2437, 22]], dtype=object)

![image.png](attachment:image.png)

#### Data: Long Term Healthcare Facility Data (8 datasets, 2 years, 4 quarters)

https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation

https://data.medicare.gov/data/archives/home-health-compare

Information for research: https://longtermcare.acl.gov/

#### Assignment
- Brainstorm features (10 min.)
- Discuss features as a pair (15 min.)
- Explore data and research (15 min.)
- Generate features and explore correlations (30 min.)