In [1]:
%logstop
%logstart -rtq ~/.logs/ml.py append
import seaborn as sns
sns.set()

In [4]:
from static_grader import grader

# ML Miniproject
## Introduction

The objective of this miniproject is to exercise your ability to create effective machine learning models for making predictions. We will be working with nursing home inspection data from the United States, predicting which providers may be fined and for how much.

## Scoring

In this miniproject you will often submit your model's `predict` or `predict_proba` method to the grader. The grader will assess the performance of your model using a scoring metric, comparing it against the score of a reference model. We will use the [average precision score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html). If your model performs better than the reference solution, then you can score higher than 1.0.

**Note:** If you use an estimator that relies on random draws (like a `RandomForestClassifier`) you should set the `random_state=` to an integer so that your results are reproducible. 

## Downloading the data

We can download the data set from Amazon S3:

In [5]:
%%bash
mkdir data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-train.csv -nc -P ./ml-data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-metadata.csv -nc -P ./ml-data

mkdir: cannot create directory ‘data’: File exists
File ‘./ml-data/providers-train.csv’ already there; not retrieving.

File ‘./ml-data/providers-metadata.csv’ already there; not retrieving.



We'll load the data into a Pandas DataFrame. Several columns will become target labels in future questions. Let's pop those columns out from the data, and drop related columns that are neither targets nor reasonable features (i.e. we don't wouldn't know how many times a facility denied payment before knowing whether it was fined).

The data has many columns. We have also provided a data dictionary.

In [6]:
import numpy as np
import pandas as pd

In [7]:
metadata = pd.read_csv('./ml-data/providers-metadata.csv')
metadata.head()

Unnamed: 0,Variable,Label,Description,Format
0,PROVNUM,Federal Provider Number,Federal Provider Number,6 alphanumeric characters
1,PROVNAME,Provider Name,Provider Name,text
2,ADDRESS,Provider Address,Provider Address,text
3,CITY,Provider City,Provider City,text
4,STATE,Provider State,Provider State,2-character postal abbreviation


In [8]:
data = pd.read_csv('./ml-data/providers-train.csv', encoding='latin1')

fine_counts = data.pop('FINE_CNT')
fine_totals = data.pop('FINE_TOT')
cycle_2_score = data.pop('CYCLE_2_TOTAL_SCORE')

In [9]:
data.head()

Unnamed: 0,PROVNUM,PROVNAME,ADDRESS,CITY,STATE,ZIP,PHONE,COUNTY_SSA,COUNTY_NAME,BEDCERT,...,CERTIFICATION,CYCLE_1_DEFS,CYCLE_1_NFROMDEFS,CYCLE_1_NFROMCOMP,CYCLE_1_DEFS_SCORE,CYCLE_1_NUMREVIS,CYCLE_1_REVISIT_SCORE,CYCLE_1_TOTAL_SCORE,CYCLE_1_SURVEY_DATE,CYCLE_2_SURVEY_DATE
0,15010,COOSA VALLEY NURSING FACILITY,315 WEST HICKORY STREET,SYLACAUGA,AL,35150,2562495604,600,Talladega,85,...,Medicare and Medicaid,7,7,0,36,1,0,36,2017-04-06,2016-05-26
1,15012,HIGHLANDS HEALTH AND REHAB,380 WOODS COVE ROAD,SCOTTSBORO,AL,35768,2562183708,350,Jackson,50,...,Medicare and Medicaid,5,5,0,44,1,0,44,2017-03-16,2016-02-04
2,15014,EASTVIEW REHABILITATION & HEALTHCARE CENTER,7755 FOURTH AVENUE SOUTH,BIRMINGHAM,AL,35206,2058330146,360,Jefferson,92,...,Medicare and Medicaid,6,6,0,40,1,0,40,2016-10-20,2015-12-30
3,15015,PLANTATION MANOR NURSING HOME,6450 OLD TUSCALOOSA HIGHWAY P O BOX 97,MC CALLA,AL,35111,2054776161,360,Jefferson,103,...,Medicare and Medicaid,2,2,0,16,1,0,16,2017-03-09,2016-02-11
4,15016,ATHENS HEALTH AND REHABILITATION LLC,611 WEST MARKET STREET,ATHENS,AL,35611,2562321620,410,Limestone,149,...,Medicare and Medicaid,2,2,0,20,1,0,20,2017-06-01,2016-05-12


## Question 1: state_model

A federal agency, Centers for Medicare and Medicaid Services (CMS), imposes regulations on nursing homes. However, nursing homes are inspected by state agencies for compliance with regulations, and fines for violations can vary widely between states.

Let's develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location. Fill in the class definition of the custom estimator, `StateMeanEstimator`, below.

**Note:** When the grader checks your answer, it passes a list of dictionaries to the `predict` method of your estimator, not a DataFrame. This means that your model must work with both data types. You can handle this by adding a test (and optional conversion) in the `predict` method of your custom class, similar to the `ColumnSelectTransformer` given below in Question 2.  

In [10]:
fine_totals

0        15259
1            0
2            0
3            0
4            0
         ...  
13887        0
13888        0
13889        0
13890        0
13891    70865
Name: FINE_TOT, Length: 13892, dtype: int64

In [11]:
data.assign(y=fine_totals)

Unnamed: 0,PROVNUM,PROVNAME,ADDRESS,CITY,STATE,ZIP,PHONE,COUNTY_SSA,COUNTY_NAME,BEDCERT,...,CYCLE_1_DEFS,CYCLE_1_NFROMDEFS,CYCLE_1_NFROMCOMP,CYCLE_1_DEFS_SCORE,CYCLE_1_NUMREVIS,CYCLE_1_REVISIT_SCORE,CYCLE_1_TOTAL_SCORE,CYCLE_1_SURVEY_DATE,CYCLE_2_SURVEY_DATE,y
0,015010,COOSA VALLEY NURSING FACILITY,315 WEST HICKORY STREET,SYLACAUGA,AL,35150,2562495604,600,Talladega,85,...,7,7,0,36,1,0,36,2017-04-06,2016-05-26,15259
1,015012,HIGHLANDS HEALTH AND REHAB,380 WOODS COVE ROAD,SCOTTSBORO,AL,35768,2562183708,350,Jackson,50,...,5,5,0,44,1,0,44,2017-03-16,2016-02-04,0
2,015014,EASTVIEW REHABILITATION & HEALTHCARE CENTER,7755 FOURTH AVENUE SOUTH,BIRMINGHAM,AL,35206,2058330146,360,Jefferson,92,...,6,6,0,40,1,0,40,2016-10-20,2015-12-30,0
3,015015,PLANTATION MANOR NURSING HOME,6450 OLD TUSCALOOSA HIGHWAY P O BOX 97,MC CALLA,AL,35111,2054776161,360,Jefferson,103,...,2,2,0,16,1,0,16,2017-03-09,2016-02-11,0
4,015016,ATHENS HEALTH AND REHABILITATION LLC,611 WEST MARKET STREET,ATHENS,AL,35611,2562321620,410,Limestone,149,...,2,2,0,20,1,0,20,2017-06-01,2016-05-12,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13887,676406,TRUCARE LIVING CENTERS - SELMA,16550 RETAMA PARKWAY,SELMA,TX,78154,2108868393,581,Guadalupe,128,...,13,12,1,88,1,0,88,2017-08-10,2016-10-05,0
13888,676408,THE LODGE AT BEAR CREEK,3729 IRA E WOODS AVENUE,GRAPEVINE,TX,76051,8178098000,910,Tarrant,100,...,2,1,1,20,1,0,20,2017-06-22,2016-07-21,0
13889,676411,CLARENDON NURSING HOME,TEN MEDICAL CENTER DR,CLARENDON,TX,79226,8068745221,431,Donley,61,...,15,11,7,116,1,0,116,2017-09-28,2016-12-14,0
13890,676412,FALL CREEK REHABILITATION AND HEALTHCARE CENTER,14949 MESA DR,HUMBLE,TX,77396,2819024152,610,Harris,126,...,4,4,1,36,1,0,36,2017-11-09,2016-12-13,0


In [12]:
data.columns

Index(['PROVNUM', 'PROVNAME', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'PHONE',
       'COUNTY_SSA', 'COUNTY_NAME', 'BEDCERT', 'RESTOT', 'INHOSP',
       'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL',
       'ADJ_TOTAL', 'OWNERSHIP', 'CERTIFICATION', 'CYCLE_1_DEFS',
       'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP', 'CYCLE_1_DEFS_SCORE',
       'CYCLE_1_NUMREVIS', 'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE',
       'CYCLE_1_SURVEY_DATE', 'CYCLE_2_SURVEY_DATE'],
      dtype='object')

In [13]:
data.assign(y=fine_totals).groupby('STATE')['y'].mean()

STATE
AK    15932.750000
AL    13672.320388
AR    17596.681592
AZ     1512.722222
CA     8054.977612
CO    22112.545918
CT     8438.121359
DC    27333.933333
DE    24899.000000
FL    16612.338211
GA    29459.975000
GU        0.000000
HI    16133.309524
IA    16565.303571
ID    42741.942029
IL     6634.197227
IN     5626.954545
KS    24420.791096
KY    32656.315385
LA     5611.819277
MA    16722.259259
MD    42806.676617
ME     1275.586957
MI    21437.397468
MN     3219.326531
MO     7635.160998
MS     7595.481283
MT    32754.970588
NC    30445.040506
ND      560.171053
NE     8377.755319
NH      260.455882
NJ     3490.756839
NM    39652.647059
NV     3050.196429
NY     2213.515260
OH     8214.822978
OK    11812.111111
OR    12365.983607
PA     9216.964687
PR    13563.333333
RI     3953.440000
SC    28205.171779
SD    13476.666667
TN    37722.919118
TX    18147.119744
UT    11496.727273
VA    19835.322835
VT    16241.147059
WA    43295.230769
WI    18425.872781
WV    52675.850467
WY    

In [14]:
dict(data.assign(y=fine_totals).groupby('STATE')['y'].mean())

{'AK': 15932.75,
 'AL': 13672.320388349515,
 'AR': 17596.6815920398,
 'AZ': 1512.7222222222222,
 'CA': 8054.977611940299,
 'CO': 22112.54591836735,
 'CT': 8438.121359223302,
 'DC': 27333.933333333334,
 'DE': 24899.0,
 'FL': 16612.338211382114,
 'GA': 29459.975,
 'GU': 0.0,
 'HI': 16133.309523809523,
 'IA': 16565.303571428572,
 'ID': 42741.942028985504,
 'IL': 6634.197226502311,
 'IN': 5626.954545454545,
 'KS': 24420.79109589041,
 'KY': 32656.315384615384,
 'LA': 5611.819277108434,
 'MA': 16722.25925925926,
 'MD': 42806.67661691542,
 'ME': 1275.5869565217392,
 'MI': 21437.39746835443,
 'MN': 3219.326530612245,
 'MO': 7635.160997732426,
 'MS': 7595.4812834224595,
 'MT': 32754.970588235294,
 'NC': 30445.040506329115,
 'ND': 560.171052631579,
 'NE': 8377.755319148937,
 'NH': 260.45588235294116,
 'NJ': 3490.756838905775,
 'NM': 39652.64705882353,
 'NV': 3050.1964285714284,
 'NY': 2213.51526032316,
 'OH': 8214.822977725675,
 'OK': 11812.111111111111,
 'OR': 12365.983606557376,
 'PA': 9216.96

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin

class ToDF(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        
        return self
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            return pd.DataFrame(X)
        else:
            return X


In [16]:
data_converter= ToDF()

In [17]:
dir(TransformerMixin)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'fit_transform']

In [18]:
data_converter.fit_transform(data)

Unnamed: 0,PROVNUM,PROVNAME,ADDRESS,CITY,STATE,ZIP,PHONE,COUNTY_SSA,COUNTY_NAME,BEDCERT,...,CERTIFICATION,CYCLE_1_DEFS,CYCLE_1_NFROMDEFS,CYCLE_1_NFROMCOMP,CYCLE_1_DEFS_SCORE,CYCLE_1_NUMREVIS,CYCLE_1_REVISIT_SCORE,CYCLE_1_TOTAL_SCORE,CYCLE_1_SURVEY_DATE,CYCLE_2_SURVEY_DATE
0,015010,COOSA VALLEY NURSING FACILITY,315 WEST HICKORY STREET,SYLACAUGA,AL,35150,2562495604,600,Talladega,85,...,Medicare and Medicaid,7,7,0,36,1,0,36,2017-04-06,2016-05-26
1,015012,HIGHLANDS HEALTH AND REHAB,380 WOODS COVE ROAD,SCOTTSBORO,AL,35768,2562183708,350,Jackson,50,...,Medicare and Medicaid,5,5,0,44,1,0,44,2017-03-16,2016-02-04
2,015014,EASTVIEW REHABILITATION & HEALTHCARE CENTER,7755 FOURTH AVENUE SOUTH,BIRMINGHAM,AL,35206,2058330146,360,Jefferson,92,...,Medicare and Medicaid,6,6,0,40,1,0,40,2016-10-20,2015-12-30
3,015015,PLANTATION MANOR NURSING HOME,6450 OLD TUSCALOOSA HIGHWAY P O BOX 97,MC CALLA,AL,35111,2054776161,360,Jefferson,103,...,Medicare and Medicaid,2,2,0,16,1,0,16,2017-03-09,2016-02-11
4,015016,ATHENS HEALTH AND REHABILITATION LLC,611 WEST MARKET STREET,ATHENS,AL,35611,2562321620,410,Limestone,149,...,Medicare and Medicaid,2,2,0,20,1,0,20,2017-06-01,2016-05-12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13887,676406,TRUCARE LIVING CENTERS - SELMA,16550 RETAMA PARKWAY,SELMA,TX,78154,2108868393,581,Guadalupe,128,...,Medicare and Medicaid,13,12,1,88,1,0,88,2017-08-10,2016-10-05
13888,676408,THE LODGE AT BEAR CREEK,3729 IRA E WOODS AVENUE,GRAPEVINE,TX,76051,8178098000,910,Tarrant,100,...,Medicare and Medicaid,2,1,1,20,1,0,20,2017-06-22,2016-07-21
13889,676411,CLARENDON NURSING HOME,TEN MEDICAL CENTER DR,CLARENDON,TX,79226,8068745221,431,Donley,61,...,Medicare and Medicaid,15,11,7,116,1,0,116,2017-09-28,2016-12-14
13890,676412,FALL CREEK REHABILITATION AND HEALTHCARE CENTER,14949 MESA DR,HUMBLE,TX,77396,2819024152,610,Harris,126,...,Medicare and Medicaid,4,4,1,36,1,0,36,2017-11-09,2016-12-13


In [19]:
state_to_average={'AL':5, 'TX' :1}
data['STATE'].apply(lambda x: state_to_average.get(x))

0        5.0
1        5.0
2        5.0
3        5.0
4        5.0
        ... 
13887    1.0
13888    1.0
13889    1.0
13890    1.0
13891    1.0
Name: STATE, Length: 13892, dtype: float64

In [20]:
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin

class GroupMeanRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, group):
        self.group = group
       

    def fit(self, X, y):
        self.group_average_=dict(data.assign(y=y).groupby(self.group)['y'].mean())
        self.mean_=y.mean()
        return self

    def predict(self, X):
        return X[self.group].apply(lambda x: self.group_average_.get(x, self.mean_))


In [21]:
predictor=GroupMeanRegressor('STATE')
predictor.fit(data, fine_totals)

GroupMeanRegressor(group='STATE')

In [22]:
predictor.predict(data)

0        13672.320388
1        13672.320388
2        13672.320388
3        13672.320388
4        13672.320388
             ...     
13887    18147.119744
13888    18147.119744
13889    18147.119744
13890    18147.119744
13891    18147.119744
Name: STATE, Length: 13892, dtype: float64

After filling in class definition, we can create an instance of the estimator and fit it to the data.

In [23]:
from sklearn.pipeline import Pipeline

state_model = Pipeline([
    ('converter', ToDF()),
     ('regressor', GroupMeanRegressor('STATE'))
    ])
state_model.fit(data, fine_totals)

Pipeline(steps=[('converter', ToDF()),
                ('regressor', GroupMeanRegressor(group='STATE'))])

Next we should test that our predict method works.

In [24]:
state_model.predict(data.sample(5))

8386     30445.040506
7372      3490.756839
743       8054.977612
13617    18147.119744
9813     11812.111111
Name: STATE, dtype: float64

However, what if we have data from a nursing home in a state (or territory) of the US which is not in the training data?

In [25]:
state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))

0    14969.857688
Name: STATE, dtype: float64

In [26]:
fine_totals_avg = fine_totals.mean()

In [27]:
grader.score.ml__state_model(lambda x: len(x) * [fine_totals_avg])

Your score: 0.019


Make sure your model can handle this possibility before submitting your model's predict method to the grader.

In [28]:
grader.score.ml__state_model(state_model.predict)

Your score: 1.000


## Question 2: simple_features_model

Nursing homes vary greatly in their business characteristics. Some are owned by the government or non-profits while others are run for profit. Some house a few dozen residents while others house hundreds. Some are located within hospitals and may work with more vulnerable populations. We will try to predict which facilities are fined based on their business characteristics.

We'll begin with columns in our DataFrame containing numeric and boolean features. Some of the rows contain null values; estimators cannot handle null values so these must be imputed or dropped. We will create a `Pipeline` containing transformers that process these features, followed by an estimator.

**Note:** As mentioned above in Question 1, when the grader checks your answer, it passes a list of dictionaries to the `predict` or `predict_proba` method of your estimator, not a DataFrame. This means that your model must work with both data types. For this reason, we've provided a custom `ColumnSelectTransformer` for you to use instead `scikit-learn`'s own `ColumnTransformer`.

In [29]:
data.head()

Unnamed: 0,PROVNUM,PROVNAME,ADDRESS,CITY,STATE,ZIP,PHONE,COUNTY_SSA,COUNTY_NAME,BEDCERT,...,CERTIFICATION,CYCLE_1_DEFS,CYCLE_1_NFROMDEFS,CYCLE_1_NFROMCOMP,CYCLE_1_DEFS_SCORE,CYCLE_1_NUMREVIS,CYCLE_1_REVISIT_SCORE,CYCLE_1_TOTAL_SCORE,CYCLE_1_SURVEY_DATE,CYCLE_2_SURVEY_DATE
0,15010,COOSA VALLEY NURSING FACILITY,315 WEST HICKORY STREET,SYLACAUGA,AL,35150,2562495604,600,Talladega,85,...,Medicare and Medicaid,7,7,0,36,1,0,36,2017-04-06,2016-05-26
1,15012,HIGHLANDS HEALTH AND REHAB,380 WOODS COVE ROAD,SCOTTSBORO,AL,35768,2562183708,350,Jackson,50,...,Medicare and Medicaid,5,5,0,44,1,0,44,2017-03-16,2016-02-04
2,15014,EASTVIEW REHABILITATION & HEALTHCARE CENTER,7755 FOURTH AVENUE SOUTH,BIRMINGHAM,AL,35206,2058330146,360,Jefferson,92,...,Medicare and Medicaid,6,6,0,40,1,0,40,2016-10-20,2015-12-30
3,15015,PLANTATION MANOR NURSING HOME,6450 OLD TUSCALOOSA HIGHWAY P O BOX 97,MC CALLA,AL,35111,2054776161,360,Jefferson,103,...,Medicare and Medicaid,2,2,0,16,1,0,16,2017-03-09,2016-02-11
4,15016,ATHENS HEALTH AND REHABILITATION LLC,611 WEST MARKET STREET,ATHENS,AL,35611,2562321620,410,Limestone,149,...,Medicare and Medicaid,2,2,0,20,1,0,20,2017-06-01,2016-05-12


In [30]:
data.columns

Index(['PROVNUM', 'PROVNAME', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'PHONE',
       'COUNTY_SSA', 'COUNTY_NAME', 'BEDCERT', 'RESTOT', 'INHOSP',
       'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL',
       'ADJ_TOTAL', 'OWNERSHIP', 'CERTIFICATION', 'CYCLE_1_DEFS',
       'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP', 'CYCLE_1_DEFS_SCORE',
       'CYCLE_1_NUMREVIS', 'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE',
       'CYCLE_1_SURVEY_DATE', 'CYCLE_2_SURVEY_DATE'],
      dtype='object')

In [31]:
data.columns[data.isnull().any()]

Index(['RESTOT', 'EXP_TOTAL', 'ADJ_TOTAL'], dtype='object')

In [32]:
class MyImputer(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        columns_missing=X.columns[X.isnull().any()]
        self.means_=X[columns_missing].mean()
        return self
    
    def transform(self, X):
       #X = X.copy()
       #X=X.fillna(self.means_)
        
        return X.copy().fillna(self.means_)
        
        #or col in columns_missing:
           #X[col] = X[col].fillna(self.means_[col])
               
        
        
        

In [33]:
imputer=MyImputer()
X_t=imputer.fit_transform(data)

In [34]:
X_t

Unnamed: 0,PROVNUM,PROVNAME,ADDRESS,CITY,STATE,ZIP,PHONE,COUNTY_SSA,COUNTY_NAME,BEDCERT,...,CERTIFICATION,CYCLE_1_DEFS,CYCLE_1_NFROMDEFS,CYCLE_1_NFROMCOMP,CYCLE_1_DEFS_SCORE,CYCLE_1_NUMREVIS,CYCLE_1_REVISIT_SCORE,CYCLE_1_TOTAL_SCORE,CYCLE_1_SURVEY_DATE,CYCLE_2_SURVEY_DATE
0,015010,COOSA VALLEY NURSING FACILITY,315 WEST HICKORY STREET,SYLACAUGA,AL,35150,2562495604,600,Talladega,85,...,Medicare and Medicaid,7,7,0,36,1,0,36,2017-04-06,2016-05-26
1,015012,HIGHLANDS HEALTH AND REHAB,380 WOODS COVE ROAD,SCOTTSBORO,AL,35768,2562183708,350,Jackson,50,...,Medicare and Medicaid,5,5,0,44,1,0,44,2017-03-16,2016-02-04
2,015014,EASTVIEW REHABILITATION & HEALTHCARE CENTER,7755 FOURTH AVENUE SOUTH,BIRMINGHAM,AL,35206,2058330146,360,Jefferson,92,...,Medicare and Medicaid,6,6,0,40,1,0,40,2016-10-20,2015-12-30
3,015015,PLANTATION MANOR NURSING HOME,6450 OLD TUSCALOOSA HIGHWAY P O BOX 97,MC CALLA,AL,35111,2054776161,360,Jefferson,103,...,Medicare and Medicaid,2,2,0,16,1,0,16,2017-03-09,2016-02-11
4,015016,ATHENS HEALTH AND REHABILITATION LLC,611 WEST MARKET STREET,ATHENS,AL,35611,2562321620,410,Limestone,149,...,Medicare and Medicaid,2,2,0,20,1,0,20,2017-06-01,2016-05-12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13887,676406,TRUCARE LIVING CENTERS - SELMA,16550 RETAMA PARKWAY,SELMA,TX,78154,2108868393,581,Guadalupe,128,...,Medicare and Medicaid,13,12,1,88,1,0,88,2017-08-10,2016-10-05
13888,676408,THE LODGE AT BEAR CREEK,3729 IRA E WOODS AVENUE,GRAPEVINE,TX,76051,8178098000,910,Tarrant,100,...,Medicare and Medicaid,2,1,1,20,1,0,20,2017-06-22,2016-07-21
13889,676411,CLARENDON NURSING HOME,TEN MEDICAL CENTER DR,CLARENDON,TX,79226,8068745221,431,Donley,61,...,Medicare and Medicaid,15,11,7,116,1,0,116,2017-09-28,2016-12-14
13890,676412,FALL CREEK REHABILITATION AND HEALTHCARE CENTER,14949 MESA DR,HUMBLE,TX,77396,2819024152,610,Harris,126,...,Medicare and Medicaid,4,4,1,36,1,0,36,2017-11-09,2016-12-13


In [35]:
import time
t_0 = time.time()
imputer=MyImputer()
X_t=imputer.fit_transform(data)
t_elapsed=time.time()-t_0

print(f" Elapsed time: {t_elapsed : g}")

 Elapsed time:  0.0140166


In [36]:
from sklearn.impute import SimpleImputer

In [37]:

simple_cols = ['BEDCERT', 'RESTOT', 'INHOSP', 'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL', 'ADJ_TOTAL']

class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]
        
simple_features = Pipeline([
    ('converter', ToDF()),
    ('selector', ColumnSelectTransformer(simple_cols)),
    ('imputer', SimpleImputer(strategy='mean'))
])

**Note:** The assertion below assumes the output of `noncategorical_features.fit_transform` is a `ndarray`, not a `DataFrame`.)

In [38]:
assert data['RESTOT'].isnull().sum() > 0
assert not np.isnan(simple_features.fit_transform(data)).any()

Now combine the `simple_features` pipeline with an estimator in a new pipeline. Fit `simple_features_model` to the data and submit `simple_features_model.predict_proba` to the grader. You may wish to use cross-validation to tune the hyperparameters of your model.

In [39]:
from sklearn.linear_model import LogisticRegression

In [40]:
simple_features_model = Pipeline([
    ('simple', simple_features),
    ('classifier', LogisticRegression())
    # add your estimator here
])

In [41]:
simple_features_model.fit(data, fine_counts > 0)

Pipeline(steps=[('simple',
                 Pipeline(steps=[('converter', ToDF()),
                                 ('selector',
                                  ColumnSelectTransformer(columns=['BEDCERT',
                                                                   'RESTOT',
                                                                   'INHOSP',
                                                                   'CCRC_FACIL',
                                                                   'SFF',
                                                                   'CHOW_LAST_12MOS',
                                                                   'SPRINKLER_STATUS',
                                                                   'EXP_TOTAL',
                                                                   'ADJ_TOTAL'])),
                                 ('imputer', SimpleImputer())])),
                ('classifier', LogisticRegression())])

In [42]:
def positive_probability(model):
    def predict_proba(X):
        return model.predict_proba(X)[:, 1]
    return predict_proba

grader.score.ml__simple_features(positive_probability(simple_features_model))

Your score: 0.996


## Question 3: categorical_features

The `'OWNERSHIP'` and `'CERTIFICATION'` columns contain categorical data. We will have to encode the categorical data into numerical features before we pass them to an estimator. Construct one or more pipelines for this purpose. Transformers such as [LabelEncoder](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) and [OneHotEncoder](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) may be useful, but you may also want to define your own transformers.

If you used more than one `Pipeline`, combine them with a `FeatureUnion`. As in Question 2, we will combine this with an estimator, fit it, and submit the `predict_proba` method to the grader.

In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13892 entries, 0 to 13891
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   PROVNUM                13892 non-null  object 
 1   PROVNAME               13892 non-null  object 
 2   ADDRESS                13892 non-null  object 
 3   CITY                   13892 non-null  object 
 4   STATE                  13892 non-null  object 
 5   ZIP                    13892 non-null  int64  
 6   PHONE                  13892 non-null  int64  
 7   COUNTY_SSA             13892 non-null  int64  
 8   COUNTY_NAME            13892 non-null  object 
 9   BEDCERT                13892 non-null  int64  
 10  RESTOT                 13483 non-null  float64
 11  INHOSP                 13892 non-null  bool   
 12  CCRC_FACIL             13892 non-null  bool   
 13  SFF                    13892 non-null  bool   
 14  CHOW_LAST_12MOS        13892 non-null  bool   
 15  SP

In [44]:
from sklearn.preprocessing import OneHotEncoder

In [49]:
categorical_features = Pipeline([
    ('converter', ToDF()),
    ('selector', ColumnSelectTransformer(['OWNERSHIP', 'CERTIFICATION'])),
    ('encoder', OneHotEncoder())

])


In [50]:
X_t=categorical_features.fit_transform(data)

In [47]:
from sklearn.pipeline import FeatureUnion

owner_onehot = Pipeline([
    ('cst', ColumnSelectTransformer(['OWNERSHIP'])),
])

cert_onehot = Pipeline([
    ('cst', ColumnSelectTransformer(['CERTIFICATION'])),
])

categorical_features = FeatureUnion([
])

ValueError: not enough values to unpack (expected 2, got 0)

In [48]:
assert categorical_features.fit_transform(data).shape[0] == data.shape[0]
assert categorical_features.fit_transform(data).dtype == np.float64
assert not np.isnan(categorical_features.fit_transform(data)).any()

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

As in the previous question, create a model using the `categorical_features`, fit it to the data, and submit its `predict_proba` method to the grader.

In [51]:
categorical_features_model = Pipeline([
    ('categorical', categorical_features),
    ('classifier', LogisticRegression())
    # add your estimator here
])

In [52]:
categorical_features_model.fit(data, fine_counts > 0)

Pipeline(steps=[('categorical',
                 Pipeline(steps=[('converter', ToDF()),
                                 ('selector',
                                  ColumnSelectTransformer(columns=['OWNERSHIP',
                                                                   'CERTIFICATION'])),
                                 ('encoder', OneHotEncoder())])),
                ('classifier', LogisticRegression())])

In [53]:
grader.score.ml__categorical_features(positive_probability(categorical_features_model))

Your score: 0.937


In [54]:
#trying to optimise

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {'max_depth': range (2, 20)}
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, verbose=1)

categorical_features_model = Pipeline([
    ('categorical', categorical_features),
    ('classifier', grid_search)
   
])

In [55]:
categorical_features_model.fit(data, fine_counts > 0)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:    0.5s finished


Pipeline(steps=[('categorical',
                 Pipeline(steps=[('converter', ToDF()),
                                 ('selector',
                                  ColumnSelectTransformer(columns=['OWNERSHIP',
                                                                   'CERTIFICATION'])),
                                 ('encoder', OneHotEncoder())])),
                ('classifier',
                 GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
                              param_grid={'max_depth': range(2, 20)},
                              verbose=1))])

In [56]:
grader.score.ml__categorical_features(positive_probability(categorical_features_model))

Your score: 1.047


## Question 4: business_model

Finally, we'll combine `simple_features` and `categorical_features` in a `FeatureUnion`, followed by an estimator in a `Pipeline`. You may want to optimize the hyperparameters of your estimator using cross-validation or try engineering new features (e.g. see [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)). When you've assembled and trained your model, pass the `predict_proba` method to the grader.

In [57]:
from sklearn.pipeline import FeatureUnion

business_features = FeatureUnion([
    ('simple', simple_features),
    ('categorical', categorical_features)
])

In [58]:
from sklearn.linear_model import LogisticRegression
business_model = Pipeline([
    ('features', business_features),
    ('classifier', LogisticRegression(max_iter=500))
    # add your estimator here
])

In [59]:
business_model.fit(data, fine_counts > 0)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('simple',
                                                 Pipeline(steps=[('converter',
                                                                  ToDF()),
                                                                 ('selector',
                                                                  ColumnSelectTransformer(columns=['BEDCERT',
                                                                                                   'RESTOT',
                                                                                                   'INHOSP',
                                                                                                   'CCRC_FACIL',
                                                                                                   'SFF',
                                                                                                   'CHOW_LAST_12MOS',
               

In [60]:
grader.score.ml__business_model(positive_probability(business_model))

Your score: 0.923


In [None]:
#trying to optimise

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {'max_depth': range (2, 20)}
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, verbose=1)

business_model = Pipeline([
    ('categorical', categorical_features),
    ('classifier', grid_search)
   
])

In [None]:
business_model.fit(data, fine_counts > 0)

In [None]:
grader.score.ml__business_model(positive_probability(business_model))

## Question 5: survey_results

Surveys reveal safety and health deficiencies at nursing homes that may indicate risk for incidents (and penalties). CMS routinely makes surveys of nursing homes. Build a model that combines the `business_features` of each facility with its cycle 1 survey results, as well as the time between the cycle 1 and cycle 2 survey to predict the cycle 2 total score.

First, let's create a transformer to calculate the difference in time between the cycle 1 and cycle 2 surveys.

In [61]:
data.columns

Index(['PROVNUM', 'PROVNAME', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'PHONE',
       'COUNTY_SSA', 'COUNTY_NAME', 'BEDCERT', 'RESTOT', 'INHOSP',
       'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL',
       'ADJ_TOTAL', 'OWNERSHIP', 'CERTIFICATION', 'CYCLE_1_DEFS',
       'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP', 'CYCLE_1_DEFS_SCORE',
       'CYCLE_1_NUMREVIS', 'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE',
       'CYCLE_1_SURVEY_DATE', 'CYCLE_2_SURVEY_DATE'],
      dtype='object')

In [62]:
data['CYCLE_2_SURVEY_DATE']

0        2016-05-26
1        2016-02-04
2        2015-12-30
3        2016-02-11
4        2016-05-12
            ...    
13887    2016-10-05
13888    2016-07-21
13889    2016-12-14
13890    2016-12-13
13891    2017-02-17
Name: CYCLE_2_SURVEY_DATE, Length: 13892, dtype: object

In [63]:
(pd.to_datetime(data['CYCLE_1_SURVEY_DATE'])-pd.to_datetime(data['CYCLE_2_SURVEY_DATE'])).dt.total_seconds()

0        27216000.0
1        35078400.0
2        25488000.0
3        33868800.0
4        33264000.0
            ...    
13887    26697600.0
13888    29030400.0
13889    24883200.0
13890    28598400.0
13891     8467200.0
Length: 13892, dtype: float64

In [64]:
#class TimedeltaTransformer(BaseEstimator, TransformerMixin):
   # def __init__(self, t1_col, t2_col):
        self.t1_col = t1_col
        self.t2_col = t2_col

    #def fit(self, X, y=None):
        #return self

    #def transform(self, X):
       # return (pd.to_datetime(X[self.t1_col])-pd.to_datetime(X[self.t2_col])).dt.total_seconds()

In [93]:
class TimedeltaTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, t1_col, t2_col, fmt=None):
        self.t1_col = t1_col
        self.t2_col = t2_col
        self.fmt = fmt
        
    def _to_datetime(self, x):
        return pd.to_datetime(x, format=self.fmt)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        t_1=self._to_datetime(X[self.t1_col])
        t_2=self._to_datetime(X[self.t2_col])
        
        return (t_1 - t_2).dt.total_seconds().values.reshape(-1, 1)

In [94]:
cycle_1_date = 'CYCLE_1_SURVEY_DATE'
cycle_2_date = 'CYCLE_2_SURVEY_DATE'
time_feature = TimedeltaTransformer(cycle_1_date, cycle_2_date, fmt='%Y-%m-%d')

In [99]:
time_feature.fit_transform(data)

array([[27216000.],
       [35078400.],
       [25488000.],
       ...,
       [24883200.],
       [28598400.],
       [ 8467200.]])

In the cell below we'll collect the cycle 1 survey features.

In [96]:
data[cycle_1_date]

0        2017-04-06
1        2017-03-16
2        2016-10-20
3        2017-03-09
4        2017-06-01
            ...    
13887    2017-08-10
13888    2017-06-22
13889    2017-09-28
13890    2017-11-09
13891    2017-05-26
Name: CYCLE_1_SURVEY_DATE, Length: 13892, dtype: object

In [72]:
cycle_1_cols = ['CYCLE_1_DEFS', 'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP',
                'CYCLE_1_DEFS_SCORE', 'CYCLE_1_NUMREVIS',
                'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE']
cycle_1_features = ColumnSelectTransformer(cycle_1_cols)

In [105]:
simple_features = Pipeline([
    ('selector', ColumnSelectTransformer(simple_cols)),
    ('imputer', SimpleImputer(strategy='mean'))
])

In [106]:
categorical_features = Pipeline([
    ('selector', ColumnSelectTransformer(['OWNERSHIP', 'CERTIFICATION'])),
    ('encoder', OneHotEncoder())

])

In [109]:
#from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
union = FeatureUnion([
        ('business', business_features),
        ('survey', cycle_1_features),
        ('time', time_feature)
])

survey_model = Pipeline([
    ('converter', ToDF()),
    ('features', union),
    ('regressor', LinearRegression())
    # add your estimator here
])

 

In [110]:
from sklearn import set_config
set_config(display='diagram')

In [111]:
survey_model

In [112]:
union.fit_transform(data, cycle_2_score.astype(int))

<13892x33 sparse matrix of type '<class 'numpy.float64'>'
	with 184633 stored elements in Compressed Sparse Row format>

In [113]:
survey_model.fit(data, cycle_2_score.astype(int))

In [114]:
grader.score.ml__survey_model(survey_model.predict)

Your score: 1.131


*Copyright &copy; 2021 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*