# Table of Contents
1. [Step 1: Gather the Data](#step1)
1. [Step 2: Prepare Data for Consumption](#step2)
   - [2.1 Import Libraries](#step2.1)
   - [2.1 Meet The Data](#step2.2)
   - [2.3 Clean Data](#step2.3)
   - [2.4 Convert Formats](#step2.4)

<a id="step1"></a>
# Step 1: Gather the Data
The data collected and created at create_ds.ipynb script

<a id="step2"></a>
# Step 2: Prepare Data for Consumption

<a id="step2.1"></a>
## 2.1 Import Libraries

In [1]:
import sys
print("Python version: {}". format(sys.version))

import pandas as pd
print("pandas version: {}". format(pd.__version__))

import numpy as np
print("NumPy version: {}". format(np.__version__))

import scipy as sp
print("SciPy version: {}". format(sp.__version__)) 

import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}". format(matplotlib.__version__))

import seaborn as sns
print("seaborn version: {}". format(sns.__version__))
%matplotlib inline

pd.set_option('display.max_columns', None)
sns.set(style='white', context='notebook', palette='deep')

Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)]
pandas version: 1.0.3
NumPy version: 1.18.2
SciPy version: 1.4.1
matplotlib version: 3.2.1
seaborn version: 0.10.0


<a id="step2.2"></a>
## 2.2 Meet The Data

Data explanation should be here

In [4]:
df = pd.read_csv("./input/autotel_with_target.csv",dtype={
    "Order number": "str",
    "Year": "str",
    "Month": "str",
    "Day": "str",
    "Hour": "str",
    "Minute": "str",
} ,index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
# print(df.isnull().sum())
# df.describe(include='all')

<a id="step2.3"><a/>
## 2.3 Clean Data

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136155 entries, 0 to 136154
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Car id                    136155 non-null  object 
 1   Order number              136155 non-null  object 
 2   Category                  136155 non-null  object 
 3   Avg distance              136155 non-null  int64  
 4   Billing minutes           132902 non-null  object 
 5   Address                   136155 non-null  object 
 6   Time                      134980 non-null  object 
 7   kmh                       132860 non-null  float64
 8   Coords                    134957 non-null  object 
 9   Lat                       134957 non-null  float64
 10  Lon                       134957 non-null  float64
 11  neighborhood              134957 non-null  object 
 12  Area                      134957 non-null  float64
 13  Population                134957 non-null  f

### 2.3. Filling empy billing minutes with 0, most of the time its canceled drive

In [7]:
df['Billing minutes'] = pd.to_numeric(df['Billing minutes'],errors='coerce')
df['Billing minutes'] = df['Billing minutes'].fillna(0)

### 2.3. Delete empty adresses

In [8]:
delete_empty_addresses = df["Address"].isnull()

In [9]:
df = df[~delete_empty_addresses]
# This is works too
# df.drop(delete_empty_addresses.index,inplace=True)
# df = df[pd.notnull(df['Address'])]

In [10]:
df["Address"].isnull().sum()

0

### 2.3. Fill na Street_c by Street
the na is only from streets consists only from numbers

In [11]:
df['Street_c'].fillna(df['Street'], inplace = True)

### 2.3. Delete canceled and NO SHOW category rows

In [12]:
df = df[(df['Category'] != 'Canceled') & (df['Category'] != 'NO SHOW')]

### 2.3. Delete empty kmh drives
Those rows with same start and destination addresses and the drive did not happen

In [13]:
df = df[(df['Avg distance'] != 0) & (df['Billing minutes'] != 0)]

### 2.3. Filling na kmh
formula = Avg distance/(Billing minutes/60)

In [14]:
# df['kmh'] = df.apply(lambda row : row['Avg distance']*(row['Billing minutes']/60) if np.isnan(row['kmh']) else row['kmh'],axis=1)

In [15]:
df['kmh'].fillna(df['Avg distance']*(df['Billing minutes']/60),inplace=True)

### 2.3. Filling empty Lat Lon Coords by similar Address rows

In [16]:
AddressCols = ['Address','Lat','Lon','neighborhood','Coords']
df[AddressCols] = df[AddressCols].sort_values(['Address']).ffill()

### 2.3. Filling na weather data with previous data

In [17]:
weatherCols = ['Time','Temprature','Max Temprature','Min Temprature','Relative Humidity','Amount of Rain','Wind Speed','Wind Direction','Max Wind Speed per Min','Max Win Speed per 10 Min']
df[weatherCols] = df[weatherCols].sort_values(['Time']).ffill()

### 2.3. Filling na Area and Population
Filling by same neighborhoods

In [18]:
neighborhoodMetaCols = ["neighborhood","Area","Population"]
df[neighborhoodMetaCols] = df[neighborhoodMetaCols].sort_values(['neighborhood']).ffill()

### 2.3. Delete unneeded columns

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128518 entries, 0 to 136153
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Car id                    128518 non-null  object 
 1   Order number              128518 non-null  object 
 2   Category                  128518 non-null  object 
 3   Avg distance              128518 non-null  int64  
 4   Billing minutes           128518 non-null  float64
 5   Address                   128518 non-null  object 
 6   Time                      128518 non-null  object 
 7   kmh                       128518 non-null  float64
 8   Coords                    128518 non-null  object 
 9   Lat                       128518 non-null  float64
 10  Lon                       128518 non-null  float64
 11  neighborhood              128518 non-null  object 
 12  Area                      128518 non-null  float64
 13  Population                128518 non-null  f

In [20]:
df.drop(["Car id","Order number","Address","Time","Street","City","Country","Address_c"],axis=1,inplace=True)
# After visualization delete Coords, Date

<a id="step2.4"><a/>
## 2.4 Convert Formats

### 2.4. Date Formats

In [21]:
# df["Time"] = pd.to_datetime(df["Time"])
df["Date"] = pd.to_datetime(df["Date"])