# PLAN

- [x] Acquisition
    - [x] read the csv into a dataframe
- [ ] Preparation
    - [ ] no missing values
    - [ ] drop columns that are not needed
    - [x] change case to lower case
    - [ ] make sure everything has right dtype
    - [ ] normalize what needs to be normalized
    - [x] rename columns for clarification
- [ ] Exploration
    - [ ] answer ALL questions raised
        - [x] Which locations are the most frequent sites of SSO?
        - [x] Which location have the most volume of overflow?
        - [x] What are most common root causes of SSO?
        - [x] Where do the majority of overflow go?

    - [ ] visualize important findings
    - [ ] decide what TODO items to keep
- [ ] Modeling
    - [ ] predict 
- [ ] Delivery
    - [ ] report
    - [ ] prezi slides
    - [ ] website

# ENVIRONMENT

In [1]:
import os
import acquire
import prepare
import pandas as pd
import numpy as np

# data visualization 
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import statsmodels.api as sm

from datetime import timedelta, datetime
from pylab import rcParams

# to explode the DataFrames and avoid truncation
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# ACQUIRE

### _Let's read in the data from the csv file and take a peek at te first five records._

In [2]:
df = acquire.read_data('saws-sso.csv')

In [3]:
df.head()

Unnamed: 0,SSO_ID,INSPKEY,SERVNO,REPORTDATE,SPILL_ADDRESS,SPILL_ST_NAME,TOTAL_GAL,GALSRET,GAL,SPILL_START,SPILL_STOP,HRS,CAUSE,COMMENTS,ACTIONS,WATERSHED,UNITID,UNITID2,DISCHARGE_TO,DISCHARGE_ROUTE,COUNCIL_DISTRICT,FERGUSON,Month,Year,Week,EARZ_ZONE,Expr1029,PIPEDIAM,PIPELEN,PIPETYPE,INSTYEAR,DWNDPTH,UPSDPTH,Inches_No,RainFall_Less3,SPILL ADDRESS,SewerAssetExp,NUM_SPILLS_COMPKEY,NUM_SPILLS_24MOS,PREVSPILL_24MOS,UNITTYPE,ASSETTYPE,LASTCLND,ResponseTime,ResponseDTTM,Public Notice,TIMEINT,Root_Cause,STEPS_TO_PREVENT,SPILL_START_2,SPILL_STOP_2,HRS_2,GAL_2,SPILL_START_3,SPILL_STOP_3,HRS_3,GAL_3
0,6582,567722.0,,3/10/19,3200,THOUSAND OAKS DR,2100,2100.0,2100.0,3/10/2019 1:16:00 PM,3/10/2019 2:40:00 PM,1.4,Grease,Spill ContainedReturned to SystemArea Cleaned ...,CLEANED MAIN,SALADO CREEK,66918,66917,STREET,,,172A2,3,2019,11,0.0,,8.0,16.55,PVC,1997.0,,,,,3200 THOUSAND OAKS DR,,1,1.0,,GRAVITY,Sewer Main,,0.45,10-Mar-19,False,24.0,,,,,0.0,0.0,,,0.0,0.0
1,6583,567723.0,,3/10/19,6804,S FLORES ST,80,0.0,80.0,3/10/2019 2:25:00 PM,3/10/2019 3:45:00 PM,1.333333,Grease,Spill ContainedArea Cleaned and Disinfected,CLEANED MAIN,DOS RIOS,24250,24193,STORMDRAIN,,3.0,251A3,3,2019,11,0.0,,8.0,157.0,PVC,1988.0,,,,,6804 S FLORES,,1,1.0,,GRAVITY,Sewer Main,,1.08,10-Mar-19,False,120.0,,,,,0.0,0.0,,,0.0,0.0
2,6581,567714.0,,3/9/19,215,AUDREY ALENE DR,79,0.0,10.0,3/9/2019 6:00:00 PM,3/9/2019 7:30:00 PM,1.5,Structural,Spill ContainedArea Cleaned and DisinfectedFlu...,CLEANED MAIN,DOS RIOS,2822,3351,ALLEY,,1.0,190E4,3,2019,10,0.0,,8.0,350.0,CP,1955.0,,,,,215 Audrey Alene Dr,,1,1.0,,GRAVITY,Sewer Main,,1.0,09-Mar-19,False,24.0,,,03/10/2019 09:36,03/10/2019 10:45,1.15,69.0,,,0.0,0.0
3,6584,567713.0,,3/9/19,3602,SE MILITARY DR,83,0.0,83.0,3/9/2019 3:37:00 PM,3/9/2019 5:00:00 PM,1.383333,Grease,Spill ContainedArea Cleaned and DisinfectedFlu...,,SALADO CREEK,92804,92805,EASEMENT,,3.0,252C3,3,2019,10,0.0,,8.0,213.91,PVC,1983.0,,,,,3602 SE MILITARY DR,,1,1.0,,GRAVITY,Sewer Main,,0.55,09-Mar-19,False,120.0,,,,,0.0,0.0,,,0.0,0.0
4,6580,567432.0,,3/6/19,100,PANSY LN,75,0.0,75.0,3/6/2019 9:40:00 AM,3/6/2019 9:55:00 AM,0.25,Structural,Spill ContainedArea Cleaned and DisinfectedFlu...,CLEANED MAIN,SALADO CREEK,61141,49543,STREET,,2.0,192A7,3,2019,10,0.0,,12.0,291.9,CP,1952.0,,,,,100 PANSY LN,,2,2.0,15-Dec-18,GRAVITY,Sewer Main,,0.0,06-Mar-19,False,3.0,,,,,0.0,0.0,,,0.0,0.0


# PREPARE

### _Let's convert the column to lowercase to make them easier to work with and also rename the column names for clarity._

In [4]:
df = prepare.lowercase_and_rename(df)

In [5]:
df.head().T

Unnamed: 0,0,1,2,3,4
sso_id,6582,6583,6581,6584,6580
inspection_key,567722,567723,567714,567713,567432
service_number,,,,,
report_date,3/10/19,3/10/19,3/9/19,3/9/19,3/6/19
spill_address,3200,6804,215,3602,100
spill_street_name,THOUSAND OAKS DR,S FLORES ST,AUDREY ALENE DR,SE MILITARY DR,PANSY LN
total_gallons,2100,80,79,83,75
gallons_returned,2100,0,0,0,0
gallons_1,2100,80,10,83,75
spill_start_1,3/10/2019 1:16:00 PM,3/10/2019 2:25:00 PM,3/9/2019 6:00:00 PM,3/9/2019 3:37:00 PM,3/6/2019 9:40:00 AM


## _Let's make copies of the original dataframe before dropping some columns and rows to cover scenarios where we uncover more information about the variables._

In [6]:
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()

### _Let's work with df1..._

### _Let's remove columns that do not add information._

### _Let's create a new address that's prettier._

In [7]:
df1 = prepare.ready_df1(df1)
df1

Unnamed: 0,report_date,spill_address,spill_street_name,total_gallons,gallons_returned,gallons_1,spill_start_1,spill_stop_1,hours_1,cause,actions,watershed,unit_id_1,unit_id_2,discharge_to,discharge_route,council_district,month,year,week,edwards_zone,pipe_diameter,pipe_length,pipe_type,installation_year,inches_no,rainfall_less_3,spill_address_full,num_spills_compkey,num_spills_24mos,unit_type,asset_type,last_cleaned,response_time,response_datetime,public_notice,time_int,root_cause,steps_to_prevent,spill_start_2,spill_stop_2,hours_2,gallons_2,spill_start_3,spill_stop_3,hours_3,gallons_3,spill_street_address
0,3/10/19,3200,THOUSAND OAKS DR,2100,2100.0,2100.0,3/10/2019 1:16:00 PM,3/10/2019 2:40:00 PM,1.400000,Grease,CLEANED MAIN,SALADO CREEK,66918,66917,STREET,,,3,2019,11,0.0,8.0,16.550000,PVC,1997.0,,,3200 THOUSAND OAKS DR,1,1.0,GRAVITY,Sewer Main,,0.45,10-Mar-19,False,24.0,,,,,0.00,0.0,,,0.00,0.0,3200 THOUSAND OAKS DR
1,3/10/19,6804,S FLORES ST,80,0.0,80.0,3/10/2019 2:25:00 PM,3/10/2019 3:45:00 PM,1.333333,Grease,CLEANED MAIN,DOS RIOS,24250,24193,STORMDRAIN,,3.0,3,2019,11,0.0,8.0,157.000000,PVC,1988.0,,,6804 S FLORES,1,1.0,GRAVITY,Sewer Main,,1.08,10-Mar-19,False,120.0,,,,,0.00,0.0,,,0.00,0.0,6804 S FLORES ST
2,3/9/19,215,AUDREY ALENE DR,79,0.0,10.0,3/9/2019 6:00:00 PM,3/9/2019 7:30:00 PM,1.500000,Structural,CLEANED MAIN,DOS RIOS,2822,3351,ALLEY,,1.0,3,2019,10,0.0,8.0,350.000000,CP,1955.0,,,215 Audrey Alene Dr,1,1.0,GRAVITY,Sewer Main,,1.00,09-Mar-19,False,24.0,,,03/10/2019 09:36,03/10/2019 10:45,1.15,69.0,,,0.00,0.0,215 AUDREY ALENE DR
3,3/9/19,3602,SE MILITARY DR,83,0.0,83.0,3/9/2019 3:37:00 PM,3/9/2019 5:00:00 PM,1.383333,Grease,,SALADO CREEK,92804,92805,EASEMENT,,3.0,3,2019,10,0.0,8.0,213.910000,PVC,1983.0,,,3602 SE MILITARY DR,1,1.0,GRAVITY,Sewer Main,,0.55,09-Mar-19,False,120.0,,,,,0.00,0.0,,,0.00,0.0,3602 SE MILITARY DR
4,3/6/19,100,PANSY LN,75,0.0,75.0,3/6/2019 9:40:00 AM,3/6/2019 9:55:00 AM,0.250000,Structural,CLEANED MAIN,SALADO CREEK,61141,49543,STREET,,2.0,3,2019,10,0.0,12.0,291.900000,CP,1952.0,,,100 PANSY LN,2,2.0,GRAVITY,Sewer Main,,0.00,06-Mar-19,False,3.0,,,,,0.00,0.0,,,0.00,0.0,100 PANSY LN
5,3/5/19,3200,S HACKBERRY ST,250,0.0,250.0,3/5/2019 2:22:00 PM,3/5/2019 2:32:00 PM,0.166667,Grease,CLEANED MAIN,DOS RIOS,38907,26117,STREET,,3.0,3,2019,10,0.0,8.0,315.000000,RL,1992.0,,,3200 S Hackberry St,2,2.0,GRAVITY,Sewer Main,,0.00,05-Mar-19,False,12.0,,,,,0.00,0.0,,,0.00,0.0,3200 S HACKBERRY ST
6,3/2/19,9910,SUGARLOAF DR,73,0.0,73.0,3/2/2019 1:42:00 PM,3/2/2019 2:55:00 PM,1.216667,Grease,CLEANED MAIN,MEDIO CREEK,85120,85363,DRAINAGE CULVERT,,4.0,3,2019,9,0.0,8.0,264.470000,PVC,1985.0,,,9910 Sugarloaf Dr,1,1.0,GRAVITY,Sewer Main,,0.73,02-Mar-19,False,120.0,GREASE,"Increase FCS,",,,0.00,0.0,,,0.00,0.0,9910 SUGARLOAF DR
7,3/1/19,3507,PIEDMONT AVE,76,0.0,76.0,3/1/2019 6:34:00 PM,3/1/2019 7:50:00 PM,1.266667,Grease,CLEANED MAIN,DOS RIOS,26128,24334,STORMDRAIN,,3.0,3,2019,9,0.0,8.0,60.000000,RL,2015.0,,,3507 Piedmont Ave,1,1.0,GRAVITY,Sewer Main,,0.43,01-Mar-19,False,120.0,,"Increase FCS,",,,0.00,0.0,,,0.00,0.0,3507 PIEDMONT AVE
8,2/26/19,349,ALICIA,3750,0.0,3750.0,2/26/2019 9:00:00 AM,2/26/2019 10:15:00 AM,1.250000,Structural,CLEANED MAIN,LEON CREEK,47292,47293,STORMDRAIN,,7.0,2,2019,9,0.0,8.0,175.390000,CP,1956.0,,,349 Alicia,1,1.0,GRAVITY,Sewer Main,,0.00,26-Feb-19,False,120.0,STRUCTURAL,"Design Request,",,,0.00,0.0,,,0.00,0.0,349 ALICIA
9,2/26/19,1502,W MISTLETOE AVE,66,0.0,66.0,2/26/2019 5:24:00 PM,2/26/2019 6:30:00 PM,1.100000,Grease,CLEANED MAIN,DOS RIOS,14241,14896,STREET,,1.0,2,2019,9,0.0,8.0,194.100000,PVC,1992.0,,,1502 W Mistletoe Ave,1,1.0,GRAVITY,Sewer Main,,0.43,26-Feb-19,False,120.0,DEBRIS,"Increase FCS,",,,0.00,0.0,,,0.00,0.0,1502 W MISTLETOE AVE


In [None]:
df1['spill_street_address'] = df1['spill_address'].map(str)+ ' ' + df1['spill_street_name']
df1.head(4).T

### _And drop the columns that are no longer needed._

In [None]:
df1 = df1.drop(columns=['spill_address_full', 'spill_address', 'spill_street_name'])
df1.head(4).T

### _Let's look at the missing values._

In [None]:
prepare.missing_values_col(df1)

### _Make new variable of whether this incident involved two or more spills within 24 hours._

In [None]:
df1['multiple_spills'] = np.where(df1['spill_start_2'].isnull(), False, True)
df1

### _Now that we've made the multiple spills variable, we can remove redundant columns._

In [None]:
df1 = df1.drop(columns=['spill_start_2',
                        'spill_stop_2',
                        'hours_2',
                        'gallons_2',
                        'spill_start_3',
                        'spill_stop_3',
                        'hours_3',
                        'gallons_3',
                        'gallons_1'
                     ])
df1.head().T

### _Now that we've removed redundant columns, we can now rename the following spill details into simpler names._

In [None]:
df1 = df1.rename(index=str, columns={"spill_start_1": "spill_start",
                               "spill_stop_1": "spill_stop",
                               "hours_1": "hours"})

### _Let's make the fields lowercase as well..._

In [None]:
df1 = lowercase_columm(df1, 'unit_type',
                 'asset_type',
                 'cause',
                 'actions',
                 'watershed',
                 'discharge_to',
                 'discharge_route',
                 'pipe_type',
                 'root_cause',
                 'spill_street_address'
                )
df1.head(4).T

_Let's assign a variable with all numerical column names._

In [None]:
numerical_columns = list(df1.select_dtypes(include=[np.number]).columns.values)
numerical_columns

_Let's assign a variable with all non-numerical column names._

In [None]:
non_numerical_columns = list(df1.select_dtypes(exclude=[np.number]).columns.values)
non_numerical_columns

_Let's fix the data types?_

In [None]:
df1.dtypes

# EXPLORE

### Looking for the repeat offenders...

In [None]:
df1.num_spills_24mos[df1.num_spills_24mos > 1].value_counts()

### Locations of the most frequent SSOs in 2 years

In [None]:
df1[['spill_street_address']][df1.num_spills_24mos >= 9]

### Total number of gallons spilled by the most frequent SSOs in 2 years

In [None]:
df1.total_gallons[df1.num_spills_24mos >= 9].agg('sum')

In [None]:
df1[['spill_street_address', 'total_gallons', 'hours', 'root_cause',
     'unit_type', 'asset_type', 'last_cleaned', 'multiple_spills',
     'discharge_to', 'discharge_route']][df1.num_spills_24mos >= 9]

### Most common root causes of SSOs

In [None]:
df1.root_cause.value_counts()

- [ ] **TODO:** Find a way to flesh out the address using regex to account for typos etc.
- [ ] **TODO:** Maybe try using unit id's instead of addresses.
- [ ] **TODO:** Drill down to only the top 3-5 locations.
- [ ] **TODO:** Compare predictions between preventing SSO on the most frequents versus not preventing.
- [ ] **TODO:** What is causing the spills on these top 3-5 locations?

In [None]:
df1.head(4).T

In [None]:
df1[['spill_street_address', 'unit_id_1','unit_id_2', 'unit_type', 'asset_type']].head(15)

- [ ] **TODO:** Maybe we can do some kind of clustering to group problem areas.

In [None]:
df1.unit_id_1.value_counts()[df1.unit_id_1.value_counts() > 7]

In [None]:
df1.unit_id_2.value_counts()[df1.unit_id_2.value_counts() > 7]

In [None]:
df1['root_cause'].value_counts()

In [None]:
df1['spill_street_address'].value_counts()[df1.spill_street_address.value_counts() > 7]

### Looking for locations with most SSOs that are also caused by grease.

In [None]:
df1.columns

In [None]:
df1['counts'] = df1.root_cause
df1['counts'] = df1.groupby(['spill_street_address']).transform('count')
df1

### Below shows the most frequent SSOs that are caused by grease.

In [None]:
df1.loc[(df1['counts'] >= 7) & (df1['root_cause'] == 'grease')]

### Below shows the most devastating SSOs by volume.

In [None]:
df1[df1.total_gallons > 1500000]

In [None]:
df1.installation_year.value_counts().sort_index()

### Spills by installation year.

In [None]:
plt.figure(figsize=(12,8))
plt.plot(df1[df1.installation_year < 9999].groupby('installation_year')['spill_street_address'].count())

In [None]:
df1.year.value_counts().sort_index()

### Spills by year.

In [None]:
plt.figure(figsize=(12,8))
plt.plot(df1[df1.year < 2019].groupby('year')['spill_street_address'].count())

### All observations grouped by month of the year.

In [None]:
plt.figure(figsize=(12,8))
plt.plot(df1.groupby('month')['spill_street_address'].count())

### Colder months mean more grease clogs. Grease solidifies in colder temperatures.

In [None]:
plt.figure(figsize=(12,8))
plt.plot(df1[(df1.root_cause == 'grease') & (df1.year < 2019)].groupby('month')['spill_street_address'].count())