# DATA 512 - Final Project Data Analysis
**Corey Christopherson**

This notebook is a record of the data analysis used for my study on the health of the Bainbridge Island aquifer system through an analysis of well log data.

In [1]:
# Import packages
import numpy as np
import pandas as pd
import time
from sklearn.cluster import KMeans

In [2]:
# Define paths
dataPath = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Final Project/Data/'

In [3]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

The data cleaning process is summarized in the hcds-final-project-data-exploration.ipynb Jupyter Notebook. The resulting file from this process is the well_data_bainbridge_island_final.csv file and contains the following fields

| Name | Description |
|:-------------------|:-----------------------------------------------------------------------------|
| Well Log Id        | Unique well ID |
| Well Comp Dt       | Well completion date |
| Year               | Well completion year |
| Well Depth Qt      | Depth of well in feet |
| Well Diameter Qt   | Diameter of well in feet |
| Well Depth         | Depth of well in feet as measured from mean sea level (MSL) |
| elevation          | Elevation of well location from the Open-Elevation API in meters |
| elevation (ft)     | Elevation of well location from the Open-Elevation API in feet |
| St Plane Xcoord Nr | Well horizontal geographic coordinate value, WA State Plane Coordinate System |
| St Plane Ycoord Nr | Well vertical geographic coordinate value, WA State Plane Coordinate System |
| latitude           | Well latitude from the NGS Coordinate Conversion and Transformation Tool (NCAT) API |
| longitude          | Well longitude from the NGS Coordinate Conversion and Transformation Tool (NCAT) API |

In [4]:
# Read in data
data_raw = pd.read_csv('{}well_data_bainbridge_island_final.csv'.format(dataPath),
                       parse_dates=['Year','Well Comp Dt'], infer_datetime_format=True)

In [5]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 12 columns):
Well Log Id           1353 non-null int64
Well Comp Dt          1353 non-null datetime64[ns]
Year                  1353 non-null datetime64[ns]
Well Depth Qt         1353 non-null float64
Well Diameter Qt      1343 non-null float64
Well Depth            1353 non-null float64
elevation             1353 non-null int64
elevation (ft)        1353 non-null float64
St Plane Xcoord Nr    1353 non-null int64
St Plane Ycoord Nr    1353 non-null int64
latitude              1353 non-null float64
longitude             1353 non-null float64
dtypes: datetime64[ns](2), float64(6), int64(4)
memory usage: 127.0 KB


Classify wells into different aquifers based on depth

Do it per decade

In [6]:
# Define decade groups
data_raw.loc[:,'Decade'] = np.where(data_raw['Year'].dt.year < 1980, 70,
                                    np.where(data_raw['Year'].dt.year < 1990, 80,
                                             np.where(data_raw['Year'].dt.year < 2000, 90,
                                                      np.where(data_raw['Year'].dt.year < 2010, 0,
                                                               10
                                                              )
                                                     )
                                            )
                                   )

In [7]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 13 columns):
Well Log Id           1353 non-null int64
Well Comp Dt          1353 non-null datetime64[ns]
Year                  1353 non-null datetime64[ns]
Well Depth Qt         1353 non-null float64
Well Diameter Qt      1343 non-null float64
Well Depth            1353 non-null float64
elevation             1353 non-null int64
elevation (ft)        1353 non-null float64
St Plane Xcoord Nr    1353 non-null int64
St Plane Ycoord Nr    1353 non-null int64
latitude              1353 non-null float64
longitude             1353 non-null float64
Decade                1353 non-null int32
dtypes: datetime64[ns](2), float64(6), int32(1), int64(4)
memory usage: 132.3 KB


In [9]:
# Break out depth data
X1 = data_raw['Well Depth'].copy()
X2 = data_raw[['Year','latitude','longitude', 'Well Depth']].copy()
X2.loc[:,'Year'] = X2.loc[:,'Year'].dt.year

In [10]:
# Define KMeans clustering machine
km = KMeans(n_clusters=5, init='random',
            n_init=10, max_iter=500, tol=1e-04)

In [11]:
# Fit model - years
y1_km = km.fit_predict(X1.values.reshape(-1,1))

In [12]:
# Fit model - all
y2_km = km.fit_predict(X2)

In [13]:
# Add in new class labels to data
data_raw.loc[:,'Aquifer'] = y1_km
data_raw.loc[:,'Aquifer - all'] = y2_km

In [14]:
## Iterate through decades and classify wells
#data_raw2 = pd.DataFrame()
#for i in X2['Decade'].drop_duplicates():
#    temp_df = data_raw[data_raw['Decade']==i].copy()
#    temp_km = km.fit_predict(temp_df['Well Depth'].values.reshape(-1,1))
#    temp_df['Aquifer - Decade2'] = temp_km
#    data_raw2 = data_raw2.append(temp_df)

In [15]:
## Adjust classification integer mapping
#
## build map
#aquifer_map = pd.DataFrame()
#for i in data_raw2['Decade'].drop_duplicates():
#    temp_df = (data_raw2[data_raw2['Decade']==i].groupby(['Decade','Aquifer - Decade2'])['Well Depth'].mean()
#               .reset_index().sort_values(['Decade','Well Depth']).reset_index(drop=True).reset_index()
#               .rename({'index':'New Class'},axis=1)[['Decade','Aquifer - Decade2','New Class']])
#    aquifer_map = aquifer_map.append(temp_df)
#
## join map to data
#data_raw2 = pd.merge(data_raw2, aquifer_map, how='left', on=['Decade','Aquifer - Decade2'])
#
## update values
#data_raw2.loc[:,'Aquifer - Decade2'] = data_raw2['New Class']
#data_raw2.drop('New Class', axis=1, inplace=True)

In [17]:
data_raw.head(2)

Unnamed: 0,Well Log Id,Well Comp Dt,Year,Well Depth Qt,Well Diameter Qt,Well Depth,elevation,elevation (ft),St Plane Xcoord Nr,St Plane Ycoord Nr,latitude,longitude,Decade,Aquifer,Aquifer - all
0,60938,1979-02-14,1979-01-01,274.0,6.0,-24.65616,76,249.34384,1135594,873308,47.70995,-122.550446,70,1,3
1,60983,1992-10-02,1992-01-01,4.0,40.0,9.12336,4,13.12336,1132849,833770,47.60141,-122.557402,90,1,3


In [18]:
# Output data
data_raw.to_csv('{}well_data_aquifer_classifications.csv'.format(dataPath),header=True,index=False)

Linear regression