# Water Potability Data Exploration Notebook

In [12]:
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T
import sys
from pyspark.sql import SparkSession, Window, DataFrame
from pyspark.mllib.stat import Statistics

sys.path.append('/home/jovyan/work')

In [13]:
spark = SparkSession.builder.getOrCreate()

## Step 1 - Exploration

In [14]:
df = spark.read.csv('../data/water_potability.csv',inferSchema=True, header=True)

In [15]:
print('Total record count: {}'.format(df.count()))

Total record count: 3276


### Make a Train/Test Split

In [16]:
df_train, df_test = df.randomSplit([0.7, 0.3], seed=42)

# Save the train and test datasets
df_train.toPandas().to_csv('../data/water_potability_train.csv')

df_test.toPandas().to_csv('../data/water_potability_test.csv')

# Get rid of df so we don't accidentally use it
del df

In [17]:
df_train.printSchema()

root
 |-- ph: double (nullable = true)
 |-- Hardness: double (nullable = true)
 |-- Solids: double (nullable = true)
 |-- Chloramines: double (nullable = true)
 |-- Sulfate: double (nullable = true)
 |-- Conductivity: double (nullable = true)
 |-- Organic_carbon: double (nullable = true)
 |-- Trihalomethanes: double (nullable = true)
 |-- Turbidity: double (nullable = true)
 |-- Potability: integer (nullable = true)



In [18]:
print('Train record count: {}'.format(df_train.count()))

Train record count: 2353


### View Sample Rows

In [19]:
df_train.show(5)

+----+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+----------+
|  ph|          Hardness|            Solids|       Chloramines|           Sulfate|      Conductivity|    Organic_carbon|  Trihalomethanes|         Turbidity|Potability|
+----+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+----------+
|null|  98.3679148956603| 28415.57583214058|10.558949998467961|  296.843207792478|505.24026927891407|12.882614472289333|85.32995534051292| 4.119087300328971|         1|
|null|103.46475866009455| 27420.16742458204| 8.417305032089528|              null|485.97450045781375|11.351132730708514| 67.8699636759021| 4.620793451653219|         0|
|null|108.91662923953173|14476.335695268315| 5.398162017711099|  281.198274407849| 512.2323064106689|15.013793389990155| 86.6714587149138| 3.89557206226812

So, this is a binary classification problem with target values of notckd (not chronic kidney disease) and ckd (chronic kidney disease). **Note:** There is an extraneous tab (\t) on at lease one of the target values. We will have to fix that up.

### Target Column Distribution

In [20]:
freq_table = df_train.select(F.col("Potability").cast("integer")).groupBy("Potability").count().toPandas()
freq_table

Unnamed: 0,Potability,count
0,1,905
1,0,1448


This is a binary classification problem with a somewhat imbalanced dataset.

### Summary Statistics

In [21]:
df_train.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
ph,1987,7.065394544064872,1.612338730707101,0.0,13.541240236567981
Hardness,2353,196.85970996091234,32.83914120248886,47.432,323.124
Solids,2353,22131.281786788568,8719.781557904313,320.942611274359,61227.19600771213
Chloramines,2353,7.134664863542833,1.5939471833043641,0.3520000000000003,13.127000000000002
Sulfate,1812,333.8051408041043,41.550506072620976,129.00000000000003,481.03064230599716
Conductivity,2353,425.74124559272485,80.696216405619,181.483753985146,753.3426195583046
Organic_carbon,2353,14.237178459474823,3.272659356030207,2.1999999999999886,27.00670661116601
Trihalomethanes,2229,66.27958491438079,16.241327199504088,8.175876384274268,120.03007700530675
Turbidity,2353,3.969157865709428,0.7823760214428234,1.45,6.739


### Null Counts

In [22]:
for col in df_train.columns:
    print('{} Null Count: {}'.format(col, df_train.where(F.col(col).isNull()).count()))

ph Null Count: 366
Hardness Null Count: 0
Solids Null Count: 0
Chloramines Null Count: 0
Sulfate Null Count: 541
Conductivity Null Count: 0
Organic_carbon Null Count: 0
Trihalomethanes Null Count: 124
Turbidity Null Count: 0
Potability Null Count: 0


We have nulls to deal with for pH, sulfates, and trihalomethanes.