<img src="../images/bikes_banner.jpg" width="1000" />

# <span style="color:#37535e">Bicycle Share Usage</span>

##  <span style='color:#3b748a'>Cleaning Relay Atlanta data</span>

<span style='color:#4095b5'>This notebook loads and clean 12 months (Sep 2017 - August 2018) of data from the Atlanta Relay bicycle share. The data from Sept 2017 is provided in a different format than the other 11 months and required additional data cleaning. The data for Aug 2018 was sent to me directly from someone at Relay.</span>

<span style='color:#4095b5'>Each row (observation) of data describes one bike ride on which a bike is taken. Each rental includes a starting place and time, an ending place and time, as well as distance, duration, user, and bike information. </span>


## <span style='color:#3b748a'>Table of contents</span>
<ul>
    <li><span style='color:#4095b5'>I.  <a href="#checking"><span style='color:#4095b5'>Data checking functions.</span></a></span></li>
    <li><span style='color:#4095b5'>II. <a href="#cleaning"><span style='color:#4095b5'>Data cleaning functions.</span></a></span></li>
    <li><span style='color:#4095b5'>III. <a href="#convert"><span style='color:#4095b5'>Convert from HTML table to CSV for Sept 2017.</span></a></span></li>
    <li><span style='color:#4095b5'>IV. <a href="#import"><span style='color:#4095b5'>Import all data.</span></a></span></li>
    <li><span style='color:#4095b5'>V. <a href="#clean"><span style='color:#4095b5'>Clean all data.</span></a></span></li>
    <li><span style='color:#4095b5'>VI. <a href="#merge"><span style='color:#4095b5'>Merge the dataframes into 1 big one.</span></a></span></li>
    <li><span style='color:#4095b5'>VII. <a href="#explore"><span style='color:#4095b5'>Explore the data.</span></a></span></li>
    <li><span style='color:#4095b5'>VIII. <a href="#write"><span style='color:#4095b5'>Write the full dataframe to a csv file.</span></a></span></li>
    <li><span style='color:#4095b5'>IX. <a href="#store"><span style='color:#4095b5'>For future reference.</span></a></span></li>
        </ul>

## <span style='color:#3b748a'>External data required</span>
<ul>
    <li><span style='color:#4095b5'>../data/atl/trips_&lt;month&gt;.csv for each month in (201709 to 201808); available in GitHub</span></li>
</ul>


## <span style='color:#3b748a'>Links</span>
<ul>
    <li><a href="http://relaybikeshare.com/system-data/"><span style='color:#4095b5'>Atlanta Relay data</span></a></li>
    <li><a href="EDA_ATL.ipynb"><span style='color:#4095b5'>Initial EDA of Atlanta data</span></a> <span style='color:#4095b5'>to understand the data and some potential issues.</span></li>
    <li><a href="plot_atl.ipynb"><span style='color:#4095b5'>Plotting Atlanta data</span></a></li>
</ul>

<hr>

In [1]:
# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import pandas as pd
import numpy as np
from datetime import datetime,timedelta
from geopy.distance import vincenty

# Gloabal variables to track 
trivial_duration = 0
trivial_distance = 0
outliers_latlon = 0
outliers_duration = 0
outliers_distance = 0

# GnBu_d
colors = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf']

<hr>
<a name="checking"> </a>
## <span style='color:#3b748a'>I. Data checking functions</span>

In [2]:
# Check which non-numeric columns are missing values 
# and what the possible values are for each object column

def check_cols(df):
    cols = df.select_dtypes([np.object]).columns
    for col in cols:
        print("{} is {} and values are {}.".format(col,df[col].dtype,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
            
    cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    for col in cols:
        print("{} is {} and values are {} to {}.".format(col,df[col].dtype,df[col].min(),df[col].max()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

In [3]:
# Check which numeric columns are missing values

def check_data(df):
    s = df.shape

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(null_data)
    print("Rows: {}\t Cols: {}\t NaNs: {}".format(s[0],s[1],null_data_count))
    if  null_data_count > 0:
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    return

<hr>
<a name="cleaning"></a>
## <span style='color:#3b748a'> II. Data cleaning functions</span>

<span style='color:#4095b5'>These functions clean the rental data.</span>

### <span style='color:#4095b5'>Drop most of the columns.</span>
<span style='color:#52aec9'>As I understand the data more and want to do more modeling, I may choose to drop fewer columns. Many of these are not used yet by Relay and are full of null values.</span>

In [4]:
def drop_columns(df):
    cols_drop = ['User ID', 'Route ID', 'Payment Plan', 'Bike ID', 'Member Type', 'Start Area',
                 'End Area', 'Ride cost', 'Fees', 'Bonuses', 'Total cost','Multiple Rental',
                 'Trip Type','Rental Access Path','Bike Region ID','Start Special Area',
                 'End Special Area']

    # Can't drop a column that isn't there
    cols_drop = list(set(df.columns) & set(cols_drop))
    df.drop(cols_drop, axis=1, inplace=True)

    return df

### <span style='color:#4095b5'>Rename columns to match Atlanta data names.</span>

In [5]:
def rename_columns(df):
    # ATL: so no renaming necessary
    return df

### <span style='color:#4095b5'>Merge with hub data.</span>
<span style='color:#52aec9'>We may have to use the start/end hubs to get start/end lat/long.</span>

In [6]:
def calc_latlong(df, df_hubs):
    # ATL: all rentals, even those from hubs, have lat/lon
    return df

### <span style='color:#4095b5'>Drop rows with nulls.</span>
<span style='color:#52aec9'>Don't have any to drop right now. Replace NaN is Hubs with 'NONE'.</span>

In [7]:
def drop_nans(df):
    df['Start Hub'] = df['Start Hub'].fillna('NONE')
    df['End Hub'] = df['End Hub'].fillna('NONE')

    return df

### <span style='color:#4095b5'>Use appropriate datatypes.</span>
<span style='color:#52aec9'>Fix Date/Time objects and cast Latitude and Longitude to floats.</span>

In [8]:
def clean_datatypes(df):
    df['Start Latitude'] = df['Start Latitude'].astype(float)
    df['Start Longitude'] = df['Start Longitude'].astype(float)
    df['End Latitude'] = df['End Latitude'].astype(float)
    df['End Longitude'] = df['End Longitude'].astype(float)

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Date'] + ' ' + df['End Time'])

    # Turn dates in datetime
    df['Start Date'] = pd.to_datetime(df['Start Date'])
    df['End Date'] = pd.to_datetime(df['End Date'])

    # Fix the durations
    df['Duration'] = pd.to_timedelta(df['Duration'])
    
    return df

### <span style='color:#4095b5'>Calculate distances.</span>
<span style='color:#52aec9'>Not used for Atlanta data. <br />Poor approximation. If bike was checked-out and returned to same station, will be trivial distance.</span>

In [9]:
def distance_calc (row):
    start = (row['Start Latitude'], row['Start Longitude'])
    stop = (row['End Latitude'], row['End Longitude'])

    return vincenty(start, stop).miles

In [10]:
def calc_distances(df):
    # ATL: No distance calculation necessary
    return df

### <span style='color:#4095b5'>Reorder columns.</span>
<span style='color:#52aec9'>Make order same as Atlanta data.</span>

In [11]:
def reorder_cols(df):
    # ATL: so not necessary
    return df

### <span style='color:#4095b5'>Drop trivial trips.</span>
<span style='color:#52aec9'>Trivial trips are of short length or the trip time is less than 3 mins.</span>

In [12]:
def drop_trivial_trips_distance(df):
    df = df[df["Distance [Miles]"] > 0.02].copy()
    return df

In [13]:
def drop_trivial_trips_duration(df):
    df = df[df["Duration"] >= pd.to_timedelta('00:03:00')].copy()
    return df

In [14]:
def drop_trivial_trips(df):
    global trivial_duration
    global trivial_distance

    rows = df.shape[0]
    df = drop_trivial_trips_duration(df)
    rows_duration = df.shape[0]
    trivial_duration += rows-rows_duration

    df = drop_trivial_trips_distance(df)
    rows_distance = df.shape[0]
    trivial_distance += rows_duration-rows_distance

    return df

### <span style='color:#4095b5'>Drop outliers.</span>
<ul>
    <li><span style='color:#52aec9'>Some of the ATL bikes are reported to be in Athens or in NYC.</span></li> 
    <li><span style='color:#52aec9'>One ATL bike was out for 36 days; 174 rentals were for longer than 24 hours.</span></li> 
    <li><span style='color:#52aec9'>One ATL bike went 3913.01 miles.</span></li> 
</ul>

In [15]:
def drop_outliers_latlon(df):
    df = df[df["Start Latitude"] < 33.9].copy()
    df = df[df["End Latitude"] < 33.9].copy()
    df = df[df["Start Latitude"] > 33.5].copy()
    df = df[df["End Latitude"] > 33.5].copy()

    df = df[df["Start Longitude"] < -83.0].copy()
    df = df[df["End Longitude"] < -83.0].copy()

    return df

In [16]:
def drop_outliers_duration(df):
    df = df[df["Duration"] <= pd.to_timedelta('24:00:00')].copy()
    return df

In [17]:
def drop_outliers_distance(df):
    df_temp = df[df["Distance [Miles]"] >= 100.0]
    if df_temp.shape[0]:
        print("Long trip: ", df_temp[['Start Latitude','Start Longitude', 'Start Time', 
                                     'End Latitude', 'End Longitude', 'End Time', 
                                     'Distance [Miles]', 'Duration']])
    df = df[df["Distance [Miles]"] < 100.0].copy()
    return df

In [18]:
def drop_outliers(df):
    global outliers_latlon
    global outliers_duration
    global outliers_distance
    
    rows = df.shape[0]
    df = drop_outliers_latlon(df)
    rows_latlon = df.shape[0]
    outliers_latlon += rows - rows_latlon
    
    df = drop_outliers_duration(df)
    rows_duration = df.shape[0]
    outliers_duration += rows_latlon - rows_duration
    
    df = drop_outliers_distance(df)
    rows_distance = df.shape[0]
    outliers_distance += rows_duration - rows_distance
    
    return df

### <span style='color:#4095b5'>Pull all of the cleaning together.</span>

In [19]:
def clean_df(df, df_hubs=None):
    global trivial_duration
    global trivial_distance
    global outliers_latlon
    global outliers_duration
    global outliers_distance

    df = drop_columns(df)
    df = rename_columns(df)
    df = calc_latlong(df, df_hubs)
    df = drop_nans(df)
    df = clean_datatypes(df)
    df = calc_distances(df)
    df = reorder_cols(df)
    df = drop_trivial_trips(df)
    df = drop_outliers(df)

    # Information about rows dropped
    print("Trivial dur: {} dist: {}".format(trivial_duration, 
                                                                              trivial_distance))
    print("Outlier loc: {} dur: {} dist: {}".format(outliers_latlon,
                                                     outliers_duration,
                                                     outliers_distance))
    return df

<hr>
<a name="convert"></a>
## <span style='color:#3b748a'> III. Convert Sept 2017 Relay Atlanta data from HTML table to CSV.</span>
<span style='color:#4095b5'>Imported html file into a spreadsheet.</span>
<ul>
    <li> <span style='color:#4095b5'>Removed index and header in csv.</span></li>
    <li> <span style='color:#4095b5'>Cleaned a number of rows with '-' for Start Lat/Long. If there is a Start Hub, used that for Lat/Lon; otherwise, delete.</span></li>
    <li> <span style='color:#4095b5'>Cleaned a number of rows with '-' for End Lat/Long. If there is a End Hub, used that for Lat/Lon; otherwise, delete.</span></li>
    <li> <span style='color:#4095b5'>Removed one outlier row.</span></li>
</ul>    

<span style='color:#4095b5'>In this notebook, recalculate 'Duration' from start and end times.</span>


In [20]:
# Calculate the Durations for Sept 2017 and write the DataFrame to a csv file
if False:
    df_temp = pd.read_csv("../data/atl/trips_201709-full.csv")
    temp_start = pd.to_datetime(df_temp['Start Date'] + ' ' + df_temp['Start Time'])
    temp_end = pd.to_datetime(df_temp['End Date'] + ' ' + df_temp['End Time'])
    df_temp['Duration'] = pd.to_timedelta(temp_end - temp_start)
    df_temp.to_csv('../data/atl/trips_201709.csv', index=False)

<hr>
<a name="import"></a>
## <span style='color:#3b748a'> IV. Import all data from Atlanta</span>


In [21]:
# Atlanta data divided up by month
trip_data = ['201709',
             '201710', '201711', '201712',
             '201801', '201802', '201803',
             '201804', '201805', '201806',
             '201807', '201808']

In [22]:
# Dictionary of DataFrames, one for each month
df_data = dict()
for d in trip_data:
    df_data[d] = pd.read_csv("../data/atl/trips_"+str(d)+".csv")

<hr>
<a name="clean"></a>

## <span style='color:#3b748a'>V. Clean all data from Atlanta.</span>
<ul>
    <li><span style='color:#4095b5'>For now, drop most of the columns.</span></li>
    <li><span style='color:#4095b5'>Drop the trivial trips.</span></li>
    <li><span style='color:#4095b5'>Drop the outliers.</span></li>
    <li><span style='color:#4095b5'>Use appropriate coumn types.</span></li>
    <li><span style='color:#4095b5'>Clean before merge because of different input data.</span></li>
</ul>

In [23]:
# For each month, clean the DataFrame
print("Cleaning the data:")
for d in trip_data:
    print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
                                                    df_data[d].shape[0], 
                                                    df_data[d].shape[1], 
                                                    sum(df_data[d].isnull().sum())))
    df_data[d] = clean_df(df_data[d])
    check_data(df_data[d])

Cleaning the data:
Month: 201709 
Rows: 14476	 Cols: 27	 NaNs: 19227
Trivial dur: 920 dist: 123
Outlier loc: 0 dur: 10 dist: 0
Rows: 13423	 Cols: 13	 NaNs: 0
Month: 201710 
Rows: 10446	 Cols: 27	 NaNs: 13773
Trivial dur: 1654 dist: 172
Outlier loc: 1 dur: 19 dist: 0
Rows: 9653	 Cols: 13	 NaNs: 0
Month: 201711 
Rows: 8672	 Cols: 27	 NaNs: 11485
Trivial dur: 2289 dist: 219
Outlier loc: 1 dur: 23 dist: 0
Rows: 7986	 Cols: 13	 NaNs: 0
Month: 201712 
Rows: 4348	 Cols: 27	 NaNs: 6001
Trivial dur: 2663 dist: 252
Outlier loc: 4 dur: 25 dist: 0
Rows: 3936	 Cols: 13	 NaNs: 0
Month: 201801 
Rows: 3933	 Cols: 30	 NaNs: 17363
Trivial dur: 3073 dist: 304
Outlier loc: 5 dur: 29 dist: 0
Rows: 3466	 Cols: 13	 NaNs: 0
Month: 201802 
Rows: 7008	 Cols: 30	 NaNs: 30612
Trivial dur: 3560 dist: 347
Outlier loc: 5 dur: 46 dist: 0
Rows: 6461	 Cols: 13	 NaNs: 0
Month: 201803 
Rows: 8815	 Cols: 30	 NaNs: 39045
Trivial dur: 4318 dist: 470
Outlier loc: 10 dur: 63 dist: 0
Rows: 7912	 Cols: 13	 NaNs: 0
Month: 201804

<hr>
<a name="merge"></a>

## <span style='color:#3b748a'> VI. Merge the DataFrames into 1 big DataFrame</span>

In [24]:
n_rows = 0
df = pd.DataFrame()
for d in trip_data:
    n_rows += df_data[d].shape[0]
    df = df.append(df_data[d])

if n_rows != df.shape[0]:
    print("There is a problem with the DataFrame merge!")

<hr>
<a name="explore"></a>

## <span style='color:#3b748a'> VII. Explore the data.</span>

In [25]:
df.head()    

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
15,NONE,33.738905,-84.431592,2017-09-04,2017-09-04 19:27:00,GORDON-WHITE PARK,33.73885,-84.43179,2017-09-04,2017-09-04 20:27:00,2766,4.5,01:00:00
16,PIEDMONT & AUBURN,33.755733,-84.382117,2017-09-16,2017-09-16 09:33:00,PIEDMONT & AUBURN,33.755738,-84.382107,2017-09-16,2017-09-16 10:33:00,2896,5.97,01:00:00
17,NONE,33.762698,-84.387455,2017-09-09,2017-09-09 16:01:00,HARDY IVY PARK,33.762682,-84.387357,2017-09-09,2017-09-09 17:01:00,2836,5.02,01:00:00
18,NONE,33.786287,-84.371913,2017-09-02,2017-09-02 14:35:00,PIEDMONT PARK EAST,33.786298,-84.371453,2017-09-02,2017-09-02 15:35:00,2564,4.13,01:00:00
19,IRWIN & EASTSIDE BELTLINE,33.757585,-84.364665,2017-09-21,2017-09-21 18:22:00,IRWIN & EASTSIDE BELTLINE,33.757607,-84.364697,2017-09-21,2017-09-21 19:22:00,2917,2.49,01:00:00


In [26]:
df.shape

(103706, 13)

In [27]:
df.columns

Index(['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration'],
      dtype='object')

In [28]:
df.describe()

Unnamed: 0,Start Latitude,Start Longitude,End Latitude,End Longitude,Distance [Miles],Duration
count,103706.0,103706.0,103706.0,103706.0,103706.0,103706
mean,33.770592,-84.377435,33.770728,-84.376944,2.335644,0 days 00:35:07.468873
std,0.014321,0.016241,0.014259,0.016214,1.957469,0 days 00:49:41.908833
min,33.657687,-84.574992,33.657703,-84.504577,0.03,0 days 00:03:00
25%,33.759437,-84.388063,33.759612,-84.387413,1.02,0 days 00:10:51
50%,33.772493,-84.372618,33.772515,-84.371422,1.83,0 days 00:23:20
75%,33.781913,-84.365252,33.781812,-84.365203,3.09,0 days 00:45:41.750000
max,33.878115,-84.160037,33.880417,-84.159963,93.23,0 days 23:59:44


In [29]:
if False:
    check_data(df)
    check_cols(df)

In [30]:
# Check dates (Sep 2017 - Aug 2018)
print("Min start date: {}".format(df['Start Date'].min()))
print("Min end date: {}".format(df['End Date'].min()))
print("Max start date: {}".format(df['Start Date'].max()))
print("Max end date: {}".format(df['End Date'].max()))
print("Number of days: {}".format(len(set(df['Start Date']))))

Min start date: 2017-09-01 00:00:00
Min end date: 2017-09-01 00:00:00
Max start date: 2018-08-31 00:00:00
Max end date: 2018-09-01 00:00:00
Number of days: 365


#### <span style='color:#4095b5'>Fewest rentals</span>
<li><span style='color:#4095b5'>11 sep 2017 - Hurricane Irma 3.5 inches of rain</span></li>

#### <span style='color:#4095b5'>Most rentals</span>
<li><span style='color:#4095b5'>1 oct 2017 - Lovely fall Saturday?</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side of Total rentals per day of week</span>
<li><span style='color:#4095b5'>4 sep 2017 - Labor Day</span></li>
<li><span style='color:#4095b5'>4 jul 2018</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>4 sep 2017 - Labor Day</span></li>
<li><span style='color:#4095b5'>31 mar 2018 - Spring Foof and Wine Crawl?</span></li>
<li><span style='color:#4095b5'>2 apr 2018 - Easter Monday</span></li>
<li><span style='color:#4095b5'>5 apr 2018 - 5 over 20 hr rental</span></li>
<li><span style='color:#4095b5'>1 may 2018 - One long rental, not really an outlier</span></li>
<li><span style='color:#4095b5'>8 june 2018 - 2018 Atlanta Moon Ride (https://events.accessatlanta.com/event/2018-atlanta-moon-ridejune-8-20185a7a9840ddecc)</span></li>
<li><span style='color:#4095b5'>4 jul 2018</span></li>

#### <span style='color:#4095b5'>Outliers on Lower side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>8 dec 2017 - Snow</span></li>
<li><span style='color:#4095b5'>23 dec 2017 - Rainy saturday, close to Christmas</span></li>
<li><span style='color:#4095b5'>12 jan 2018 - Cold and rainy</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side Total or Avg Distance per day of week</span>
<li><span style='color:#4095b5'>26 dec 2017 - Post Christmas workouts?</span></li>
<li><span style='color:#4095b5'>8 jun 2018 - 2018 Atlanta Moon Ride</span></li></span></li>

#### <span style='color:#4095b5'>Outliers on Lower side Total or Avg Distance per day of week</span>
<li><span style='color:#4095b5'>8 dec 2017 - Snow</span></li>
<li><span style='color:#4095b5'>9 dec 2017 - Snow</span></li>
<li><span style='color:#4095b5'>15 dec 2017 - Cold</span></li>
<li><span style='color:#4095b5'>23 dec 2017 - Rainy saturday, close to Christmas</span></li>
<li><span style='color:#4095b5'>12 jan 2018 - Cold and rainy</span></li>
<li><span style='color:#4095b5'>10 feb 2018 - Rainy</span></li>


<hr>
<a name="write"></a>

## <span style='color:#3b748a'>VIII. Write the full DataFrame to a csv file.</span>

In [39]:
df.to_csv('../data/atl/trips_all.csv', index=False)

<hr>
<a name="store"></a>

## <span style='color:#3b748a'>IX. For future reference.</span>

In [None]:
# df.groupby('Bike Name').sum()[['Distance [Miles]']].sort_values('Distance [Miles]')
# df.groupby('Start Date').count()[['Duration']]
# df.groupby(['Bike Name']).sum().sort_values(by='Distance [Miles]', ascending=True).tail()

In [None]:
# outliers = pd.DataFrame()
# outliers_distance = pd.DataFrame()
# outliers_duration = pd.DataFrame()

# trip_data = ['201709',
#              '201710', '201711', '201712',
#              '201801', '201802', '201803',
#              '201804', '201805', '201806',
#              '201807']

# d = '201807'
# print("Cleaning the data:")
# print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
#                                                     df_data[d].shape[0], 
#                                                     df_data[d].shape[1], 
#                                                     sum(df_data[d].isnull().sum())))
# df_temp = df_data[d].copy()
# print(df_temp.shape)
# df_temp = drop_columns(df_temp)
# print(df_temp.shape)

# df_temp.columns

# df_temp = rename_columns(df_temp)
# df_temp.columns

# df_temp = calc_latlong(df_temp, None)
# print(df_temp.shape)
# print(sum(df.isnull().sum()))
# df = drop_nans(df_temp)
# print(df_temp.shape)
# print(sum(df.isnull().sum()))

# df_temp = clean_datatypes(df_temp)
# df_temp = calc_distances(df_temp)
# df_temp = reorder_cols(df_temp)

# print(df_temp.shape)
# df_temp = drop_trivial_trips(df_temp)
# print(df_temp.shape)

# df2 = df_temp[df_temp["Start Latitude"] >= 33.9].copy()
# outliers = outliers.append(df2)
# df3 = df_temp[df_temp["End Latitude"] >= 33.9].copy()
# outliers = outliers.append(df3)
# df4 = df_temp[df_temp["Start Latitude"] < 33.5].copy()
# outliers = outliers.append(df4)
# df5 = df_temp[df_temp["End Latitude"] < 33.5].copy()
# outliers = outliers.append(df5)
# df6 = df_temp[df_temp["Start Longitude"] >= -83.0].copy()
# outliers = outliers.append(df6)
# df7 = df_temp[df_temp["End Longitude"] >= -83.0].copy()
# outliers = outliers.append(df7)

# df8 = df_temp[df_temp["Duration"] > pd.to_timedelta('24:00:00')].copy()
# outliers_duration = outliers_duration.append(df8)
# df9 = df_temp[df_temp["Distance [Miles]"] >= 100.0].copy()
# outliers_distance = outliers_distance.append(df9)

# print(df_temp.shape)
# df_temp = drop_outliers(df_temp)
# print(df_temp.shape)

# outliers.shape

# outliers_distance.shape

# outliers_duration.shape

# outliers.sort_values(by="Start Latitude")



# outliers.to_csv('../data/atl/outliers.csv', index=False)
# outliers_duration.to_csv('../data/atl/outliers_duration.csv', index=False)
# outliers_distance.to_csv('../data/atl/outliers_distance.csv', index=False)

# outliers_distance.sort_values(by='Distance [Miles]')