# Stage 2: Improved Data Preprocessing and Modeling with Sagemaker

Goal: predict whether or not water is acceptable for human consumption (potability) based on its chemical characteristics.

Data source: [Kaggle Water Quality Dataset](https://www.kaggle.com/adityakadiwal/water-potability)

In [1]:
import numpy as np
import pandas as pd

In [2]:
bucket = '<INSERT BUCKET NAME HERE>'
input_prefix = 'input'
input_key = 'water_potability.csv'
container_uri = '<INSERT CUSTOM CONTAINER URI HERE>'

region = 'us-west-2'

In [3]:
s3_input_uri = 's3://{}/{}/{}'.format(bucket, input_prefix, input_key)

In [4]:
df = pd.read_csv(s3_input_uri)
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [5]:
cols = df.columns

In [6]:
df.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

In [7]:
df.shape

(3276, 10)

In [8]:
df.Potability.value_counts()

0    1998
1    1278
Name: Potability, dtype: int64

In [9]:
df_notpotable = df[df['Potability'] == 0]
df_potable = df[df['Potability'] == 1]

In [10]:
df_notpotable.isnull().sum()

ph                 314
Hardness             0
Solids               0
Chloramines          0
Sulfate            488
Conductivity         0
Organic_carbon       0
Trihalomethanes    107
Turbidity            0
Potability           0
dtype: int64

In [11]:
df_potable.isnull().sum()

ph                 177
Hardness             0
Solids               0
Chloramines          0
Sulfate            293
Conductivity         0
Organic_carbon       0
Trihalomethanes     55
Turbidity            0
Potability           0
dtype: int64

<h3>Imputing</h3>
Note that this transformation does not need to be part of the pipeline that gets deployed to the inference endpoint because the data that is coming into the endpoint should not have any missing values.

In [12]:
from sklearn.impute import SimpleImputer

impute = SimpleImputer(missing_values=np.nan, strategy='mean')

df_notpotable = pd.DataFrame(impute.fit_transform(df_notpotable), columns = cols)

df_potable = pd.DataFrame(impute.fit_transform(df_potable), columns = cols)

In [13]:
df_notpotable.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

In [14]:
df_potable.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

In [15]:
df = pd.concat([df_notpotable, df_potable])

In [16]:
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,7.085378,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0.0
1,3.71608,129.422921,18630.057858,6.635246,334.56429,592.885359,15.180013,56.329076,4.500656,0.0
2,8.099124,224.236259,19909.541732,9.275884,334.56429,418.606213,16.868637,66.420093,3.055934,0.0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0.0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0.0


In [17]:
df = df.sample(frac = 1)

In [18]:
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
847,5.429335,183.439383,15265.407564,5.714731,394.001195,446.879149,17.581557,50.266951,3.081736,1.0
1667,7.085378,217.944979,37820.047327,8.299339,334.56429,367.570082,15.421034,36.446614,2.99478,0.0
1866,7.085378,188.445469,28791.614416,8.040356,382.009477,422.234861,10.57569,63.235365,3.228379,0.0
1244,5.91054,241.140746,25721.833866,4.806759,385.887468,462.61253,14.316821,60.590359,4.007508,1.0
406,6.361667,175.043999,25833.851713,8.243781,333.947107,302.19071,10.558576,70.107693,3.681765,0.0


In [19]:
x = df.drop('Potability', axis = 1)
y = df['Potability']

<h3>Scaling (normalization)</h3>

In [20]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x)

x = scaler.transform(x)

x = pd.DataFrame(x)
x

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.387810,0.493331,0.245368,0.419783,0.752779,0.464092,0.589332,0.401818,0.308515
1,0.506098,0.618491,0.615686,0.622101,0.583939,0.325406,0.506553,0.289697,0.292074
2,0.506098,0.511489,0.467451,0.601828,0.718714,0.420997,0.320908,0.507029,0.336241
3,0.422181,0.702627,0.417049,0.348709,0.729730,0.491605,0.464246,0.485570,0.483552
4,0.454405,0.462879,0.418888,0.617752,0.582185,0.211078,0.320252,0.562782,0.421964
...,...,...,...,...,...,...,...,...,...
3271,0.671132,0.416137,0.146205,0.620198,0.583939,0.472216,0.399721,0.645064,0.393363
3272,0.483578,0.559388,0.419771,0.705717,0.564211,0.590874,0.594648,0.724691,0.709673
3273,0.553152,0.501051,0.496922,0.601860,0.578265,0.232474,0.572121,0.479739,0.710471
3274,0.421167,0.476061,0.280745,0.386705,0.583939,0.510970,0.545506,0.638747,0.677416


<h3>OverSampling</h3>
Also does not need to be in the pipeline at the endpoint because it is solely for generating more training examples.

In [22]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.8.0-py3-none-any.whl (206 kB)
[K     |████████████████████████████████| 206 kB 5.3 MB/s eta 0:00:01
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.8.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [23]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE()
x_res, y_res = oversample.fit_resample(x,y)

In [24]:
x_res.shape, y_res.shape

((3996, 9), (3996,))

<h3>Modeling</h3>

In [25]:
from numpy import mean
from numpy import std
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [26]:
cv = KFold(n_splits=10, random_state=1, shuffle=True)

In [27]:
rf = RandomForestClassifier(n_estimators=1000)

In [28]:
scores = cross_val_score(rf, x_res, y_res, scoring='accuracy', cv=cv, n_jobs=-1)

In [29]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.824 (0.014)


In [30]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, stratify = y)

In [32]:
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)

RandomForestClassifier(n_estimators=1000)

In [34]:
from sklearn.metrics import accuracy_score
rf_predictions = rf.predict(x_test)
rf_accuracy = accuracy_score(rf_predictions, y_test)
print(rf_accuracy)

0.823170731707317


<h3>Sagemaker Pipeline</h3>

In [37]:
preprocessing_script = '/home/ec2-user/SageMaker/Water-Quality-Project/water_quality_preprocessing.py'

In [42]:
import sagemaker
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()

We create a preprocessor using the script. We can use this to send raw data and it outputs the processed data. When inserted into a pipeline, it will output the data in JSON format (according to our output_fn function that we wrote) which is automatically passed to the next container in the Pipeline (which would be the inference container). This happens because JSON is the format that can be read by other containers by default.

To do this, we define an SKLearn estimator (we don't actually use it as an estimator, which is why we had to overwrite the predict_fn and model_fn functions).

In [61]:
from sagemaker.sklearn.estimator import SKLearn

sklearn_preprocessor = SKLearn(
    entry_point = preprocessing_script, 
    role = role,
    framework_version="0.20.0",  # now required (0.23-1 is also supported but requires code change)
    py_version="py3",  # now required
    instance_type = 'ml.m4.xlarge',
    # instance_type = 'local',
    # sagemaker_session = sagemaker_session
)

In [None]:
sklearn_preprocessor.fit({'train': s3_input_uri})

<h3>Build your own container</h3>

In [67]:
from sagemaker.estimator import Estimator

byoc_est = Estimator(
    role=role,
    instance_count=1,
    instance_type='local',
    image_uri=container_uri
)

In [None]:
byoc_est.fit({'train': s3_input_uri})