안녕하세요 여러분?

이번 시간에는 WATER QUALITY 데이터 분석을 해 보도록 하겠습니다.

* 파이썬 코드는 GOOGLE COLAB 에서 실행하도록 합니다.
=> http://colab.research.google.com

* 실습 파일은 github 또는 강의게시판에 있습니다.
=> github: http://github.com/dscoool/waterai/
=> eCampus:

코드를 입력하며 같이 실습해 보도록 합시다!

실습을 할 때 주의할 점 - 반드시 손으로 코드를 일일히 타이핑해서 입력해 보고, **CTRL C + V 를 사용하지 마세요!! **코드를 복붙하면 실력이 늘지 않습니다!! 단, 주소 등은 복사, 붙여넣기 해도 되어요.





In [None]:
import numpy as np # 넘파이 - 수치 계산 라이브러리입니다.
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data Visualization

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Water Quality
Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.


1. [Load and Check Data](#1)

1. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable](#4)
        * [Numerical Variable](#5)

1. [Missing Value](#6)
    * [Find Missing Value](#7)
    * [Fill Missing Value](#8)

1. [Visualization](#9)    
    
1. [Modeling](#10)

1. [Summary](#11)


<a id = "1"></a><br>
# Load and Check Data

In [None]:
data = pd.read_csv("/kaggle/input/water-potability/water_potability.csv")
data.head()

In [None]:
data.describe()

<a id = "2"></a><br>
# Variable Description

* pH : Acidity of water.
* Hardness: Hardness of water.
* Solids : Solids dissolved in water.
* Chloramines: Chloramines dissolved in water.
* Sulfate : Sulfate contained in water.
* Conductivity: Conductivity of electric of water.
* Organic Carbon : Organic Carbon dissolved in water.
* Trihalomethanes: Trihalomethanes are chemicals that may be found in water.
* Turbidity: Measure of light emitting.
* Potability: Classification of water if it is healthy or not. 1 is healthy, 0 is unhealthy.

In [None]:
data.info()

<a id = "3"></a><br>
# Univariate Variable Analysis
* Categorical Variable: Potability
* Numerical Variable: pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic Carbon, Trihalomethanes, Turbidity

<a id = "4"></a><br>
## Categorical Variable

In [None]:
plt.figure(figsize = (4,4))
sns.countplot(data = data, x = data["Potability"])
plt.show()

* We clearly see that our data is not balanced.

<a id = "5"></a><br>
## Numerical Variables:
Our numerical variables are:
* pH,
* Hardness
* Solids
* Chloramines
* Sulfate
* Conductivity
* Organic Carbon
* Trihalomethanes
* Turbidity

We should see their distribution and how they affect potability on a histogram plot.

In [None]:
def histplot(var):
    plt.figure(figsize = (6,3))
    sns.histplot(data = data, x = data[var], hue = data.Potability)
    plt.xlabel(var)
    plt.ylabel("count")
    plt.show()

In [None]:
numvars = ["ph","Hardness","Solids","Chloramines","Sulfate","Conductivity","Organic_carbon","Trihalomethanes","Turbidity"]
for n in numvars:
    histplot(n)

<a id = "6"></a><br>
# Missing Value


<a id = "7"></a><br>
## Find Missing Value

In [None]:
data.isnull().sum()

<a id = "8"></a><br>
## Dropping Missing Values

In [None]:
data = data.dropna()
data.isnull().sum()

<a id = "9"></a><br>
# Visualization

In [None]:
numvars = ["ph","Hardness","Solids","Chloramines","Sulfate","Conductivity","Organic_carbon","Trihalomethanes","Turbidity"]
fig, axes = plt.subplots(3, 3, figsize=(9,9))

fig.suptitle('Distribution of Features')
plt.grid()
sns.boxplot(ax=axes[0, 0], data=data, x='ph')
sns.boxplot(ax=axes[0, 1], data=data, x='Hardness')
sns.boxplot(ax=axes[0, 2], data=data, x='Solids')
sns.boxplot(ax=axes[1, 0], data=data, x='Chloramines')
sns.boxplot(ax=axes[1, 1], data=data, x='Sulfate')
sns.boxplot(ax=axes[1, 2], data=data, x='Conductivity')
sns.boxplot(ax=axes[2, 0], data=data, x='Organic_carbon')
sns.boxplot(ax=axes[2, 1], data=data, x='Trihalomethanes')
sns.boxplot(ax=axes[2, 2], data=data, x='Turbidity')
plt.show()

* We can see that all our features distributed normally. Let's get to the machine learning.

<a id = "10"></a><br>
# Modeling

In [None]:
pip install pycaret

In [None]:
from pycaret.classification import *
clf = setup(data, target = "Potability", session_id = 786)
compare_models()

In [None]:
model = create_model("et")
predict = predict_model(model, data=data)
predict.head()

<a id = "11"></a><br>
# Summary
Extra Trees Classifier gave the best accuracy around mostly known classifying models. So we decided to go on with it. It turned out %90 of accuracy. I've used PyCaret library in this notebook. It automatically fits all classification models to your specified data and feature and returns all the accuracies for each model.