# Table of Contents
1. [Step 1: Gather the Data](#step1)
1. [Step 2: Prepare Data for Consumption](#step2)
   - [2.1 Import Libraries](#step2.1)
   - [2.1 Meet The Data](#step2.2)
   - [2.3 Clean Data](#step2.3)
   - [2.4 Convert Formats](#step2.4)

<a id="step1"></a>
# Step 1: Gather the Data
The data collected and created at create_ds.ipynb script

<a id="step2"></a>
# Step 2: Prepare Data for Consumption

<a id="step2.1"></a>
## 2.1 Import Libraries

In [46]:
import sys
print("Python version: {}". format(sys.version))

import pandas as pd
print("pandas version: {}". format(pd.__version__))

import numpy as np
print("NumPy version: {}". format(np.__version__))

import scipy as sp
print("SciPy version: {}". format(sp.__version__)) 

import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}". format(matplotlib.__version__))

import seaborn as sns
print("seaborn version: {}". format(sns.__version__))
%matplotlib inline

pd.set_option('display.max_columns', None)
sns.set(style='white', context='notebook', palette='deep')

Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)]
pandas version: 1.0.3
NumPy version: 1.18.2
SciPy version: 1.4.1
matplotlib version: 3.2.1
seaborn version: 0.10.0


<a id="step2.2"></a>
## 2.2 Meet The Data

Data explanation should be here

In [47]:
df = pd.read_csv("./input/autotel.csv",dtype={
    "Order number": "str",
    "Year": "str",
    "Month": "str",
    "Day": "str",
    "Hour": "str",
    "Minute": "str",
} ,index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [48]:
# print(df.isnull().sum())
# df.describe(include='all')

<a id="step2.3"><a/>
## 2.3 Clean Data

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285280 entries, 0 to 300003
Data columns (total 35 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Car id                    285280 non-null  object 
 1   Order number              285280 non-null  object 
 2   Category                  285280 non-null  object 
 3   Avg distance              285280 non-null  int64  
 4   Billing minutes           269948 non-null  object 
 5   Address                   285280 non-null  object 
 6   Time                      284105 non-null  object 
 7   kmh                       266294 non-null  float64
 8   Coords                    270496 non-null  object 
 9   Lat                       270496 non-null  float64
 10  Lon                       270496 non-null  float64
 11  neighborhood              270496 non-null  object 
 12  Area                      270496 non-null  float64
 13  Population                270496 non-null  f

### 2.3. Filling empy billing minutes with 0, most of the time its canceled drive

In [50]:
df['Billing minutes'] = pd.to_numeric(df['Billing minutes'],errors='coerce')
df['Billing minutes'] = df['Billing minutes'].fillna(0)

### 2.3. Delete empty adresses

In [51]:
delete_empty_addresses = df["Address"].isnull()

In [52]:
df = df[~delete_empty_addresses]
# This is works too
# df.drop(delete_empty_addresses.index,inplace=True)
# df = df[pd.notnull(df['Address'])]

In [53]:
df["Address"].isnull().sum()

0

### 2.3. Fill na Street_c by Street
the na is only from streets consists only from numbers

In [54]:
df['Street_c'].fillna(df['Street'], inplace = True)

### 2.3. Delete canceled and NO SHOW category rows

In [55]:
df = df[(df['Category'] != 'Canceled') & (df['Category'] != 'NO SHOW')]

### 2.3. Delete empty kmh drives
Those rows with same start and destination addresses and the drive did not happen

In [56]:
df = df[(df['Avg distance'] != 0) & (df['Billing minutes'] != 0)]

### 2.3. Filling na kmh
formula = Avg distance/(Billing minutes/60)

In [57]:
# df['kmh'] = df.apply(lambda row : row['Avg distance']*(row['Billing minutes']/60) if np.isnan(row['kmh']) else row['kmh'],axis=1)

In [58]:
df['kmh'].fillna(df['Avg distance']*(df['Billing minutes']/60),inplace=True)

### 2.3. Filling empty Lat Lon Coords by similar Address rows

In [59]:
AddressCols = ['Address','Lat','Lon','neighborhood','Coords']
df[AddressCols] = df[AddressCols].sort_values(['Address']).ffill()

### 2.3. Filling na weather data with previous data

In [60]:
weatherCols = ['Time','Temprature','Max Temprature','Min Temprature','Relative Humidity','Amount of Rain','Wind Speed','Wind Direction','Max Wind Speed per Min','Max Win Speed per 10 Min']
df[weatherCols] = df[weatherCols].sort_values(['Time']).ffill()

### 2.3. Filling na Area and Population
Filling by same neighborhoods

In [64]:
neighborhoodMetaCols = ["neighborhood","Area","Population"]
df[neighborhoodMetaCols] = df[neighborhoodMetaCols].sort_values(['neighborhood']).ffill()

### 2.3. Delete unneeded columns

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 257595 entries, 0 to 300001
Data columns (total 27 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Category                  257595 non-null  object 
 1   Avg distance              257595 non-null  int64  
 2   Billing minutes           257595 non-null  float64
 3   kmh                       257595 non-null  float64
 4   Coords                    257595 non-null  object 
 5   Lat                       257595 non-null  float64
 6   Lon                       257595 non-null  float64
 7   neighborhood              257595 non-null  object 
 8   Area                      257595 non-null  float64
 9   Population                257595 non-null  float64
 10  Street_c                  257595 non-null  object 
 11  Address_c2                257595 non-null  object 
 12  Date                      257595 non-null  object 
 13  Year                      257595 non-null  o

In [62]:
df.drop(["Car id","Order number","Address","Time","Street","City","Country","Address_c"],axis=1,inplace=True)
# After visualization delete Coords, Date

<a id="step2.4"><a/>
## 2.4 Convert Formats

### 2.4. Date Formats

In [18]:
# df["Time"] = pd.to_datetime(df["Time"])
df["Date"] = pd.to_datetime(df["Date"])