# Initial Exploratory Taxi Trip DataSet

The objective of this notebook is to carry out a final analysis of the taxi trips in Chicago together with the dataset of the weather in Chicago and eliminate possible trips that have erroneous data and have not been eliminated in the previous filters.

## Chicago Coordinates 
-87.6244212, 41.8755616

## 1 Create our environment

In this section we are going to create the Spark session

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession, SQLContext
from pyspark import SparkConf, SparkContext

In [3]:
sc_conf = SparkConf()

In [4]:
sc_conf.set('spark.driver.port', '62678')
sc_conf.set('spark.rdd.compress', 'True')
sc_conf.set('spark.driver.host', '127.0.0.1')
sc_conf.set('spark.serializer.objectStreamReset', '100')
sc_conf.set('spark.master', 'local[*]')
sc_conf.set('spark.executor.id', 'driver')
sc_conf.set('spark.submit.deployMode', 'client')
sc_conf.set('spark.ui.showConsoleProgress', 'true')
sc_conf.set('spark.app.name', 'pyspark-shell')
sc_conf.set("spark.executor.memory","6g")
sc_conf.set("spark.driver.memory","6g")

<pyspark.conf.SparkConf at 0x1122c1c50>

In [5]:
sc_conf.getAll()

dict_items([('spark.driver.port', '62678'), ('spark.rdd.compress', 'True'), ('spark.driver.host', '127.0.0.1'), ('spark.serializer.objectStreamReset', '100'), ('spark.master', 'local[*]'), ('spark.executor.id', 'driver'), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true'), ('spark.app.name', 'pyspark-shell'), ('spark.executor.memory', '6g'), ('spark.driver.memory', '6g')])

In [6]:
sc = SparkContext(conf=sc_conf)

In [7]:
sql = SQLContext(sc)

In [8]:
session = sql.sparkSession
session

In [9]:
session.sparkContext.getConf().getAll()

[('spark.executor.memory', '6g'),
 ('spark.driver.port', '62678'),
 ('spark.driver.host', '127.0.0.1'),
 ('spark.driver.memory', '6g'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.app.id', 'local-1557775041619')]

## 2 Load the libraries

In [10]:
# We load the libraries we are going to use in our analysis
%matplotlib inline

import shutil
import os 
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = None

## 3 Read the data

In this section what we are going to do is to load the datasets of Chicago Taxi Trips and Chicago Weather.

### 2.1 Taxi trips

In [11]:
# Before loading our dataset, we define the type of fields it has 
taxi_schema = StructType([
    StructField("trip_id", StringType(), nullable = True),
    StructField("taxi_id", StringType(), nullable = True),
    StructField("trip_start_timestamp", TimestampType(), nullable = True),
    StructField("trip_end_timestamp", TimestampType(), nullable = True),
    StructField("trip_seconds", IntegerType(), nullable = True),
    StructField("trip_miles", DoubleType(), nullable = True),
    StructField("pickup_community_area", IntegerType(), nullable = True),
    StructField("dropoff_community_area", IntegerType(), nullable = True),
    StructField("fare", DoubleType(), nullable = True),
    StructField("tips", DoubleType(), nullable = True),
    StructField("tolls", DoubleType(), nullable = True),
    StructField("extras", DoubleType(), nullable = True),
    StructField("trip_total", DoubleType(), nullable = True),
    StructField("payment_type", StringType(), nullable = True),
    StructField("company", StringType(), nullable = True),
    StructField("pickup_centroid_latitude", DoubleType(), nullable = True),
    StructField("pickup_centroid_longitude", DoubleType(), nullable = True),
    StructField("dropoff_centroid_latitude", DoubleType(), nullable = True),
    StructField("dropoff_centroid_longitude", DoubleType(), nullable = True)])

In [12]:
# We load the data of Chicago Taxi Trips
taxi_trips = session.read.csv('../Data/taxi_chicago_filter.csv',
                              header=True,
                              schema=taxi_schema)

In [13]:
# We write our data in parquet and load it again
taxi_trips.write.parquet("../Data/temp/taxi_trips.parquet")
taxi_trips = session.read.parquet("../Data/temp/taxi_trips.parquet")

#### 2.1.1 Check the correct load 

In [14]:
taxi_trips.show(5)

+--------------------+--------------------+--------------------+-------------------+------------+----------+---------------------+----------------------+-----+----+-----+------+----------+------------+--------------------+------------------------+-------------------------+-------------------------+--------------------------+
|             trip_id|             taxi_id|trip_start_timestamp| trip_end_timestamp|trip_seconds|trip_miles|pickup_community_area|dropoff_community_area| fare|tips|tolls|extras|trip_total|payment_type|             company|pickup_centroid_latitude|pickup_centroid_longitude|dropoff_centroid_latitude|dropoff_centroid_longitude|
+--------------------+--------------------+--------------------+-------------------+------------+----------+---------------------+----------------------+-----+----+-----+------+----------+------------+--------------------+------------------------+-------------------------+-------------------------+--------------------------+
|fd7ca20775773ffc5.

In [None]:
# We check the columns types
taxi_trips.printSchema()

root
 |-- trip_id: string (nullable = true)
 |-- taxi_id: string (nullable = true)
 |-- trip_start_timestamp: timestamp (nullable = true)
 |-- trip_end_timestamp: timestamp (nullable = true)
 |-- trip_seconds: integer (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- pickup_community_area: integer (nullable = true)
 |-- dropoff_community_area: integer (nullable = true)
 |-- fare: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- extras: double (nullable = true)
 |-- trip_total: double (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- company: string (nullable = true)
 |-- pickup_centroid_latitude: double (nullable = true)
 |-- pickup_centroid_longitude: double (nullable = true)
 |-- dropoff_centroid_latitude: double (nullable = true)
 |-- dropoff_centroid_longitude: double (nullable = true)



#### 2.1.2 Study the dimensions of the dataset

In [None]:
taxi_trips.count()

29755922

In [None]:
len(taxi_trips.columns)

19

### 2.2 Chicago Weather

In [None]:
# Before loading our dataset, we define the type of fields it has 
weather_schema = StructType([
    StructField("datetime", TimestampType(), nullable = True),
    StructField("humidity", DoubleType(), nullable = True),
    StructField("pressure", DoubleType(), nullable = True),
    StructField("temperature", DoubleType(), nullable = True),
    StructField("weather_description", StringType(), nullable = True),
    StructField("wind_direction", DoubleType(), nullable = True),
    StructField("wind_speed", DoubleType(), nullable = True)])

In [None]:
# We load the data of Chicago Weather
chicago_weather = session.read.csv('../Data/Chicago_weather.csv.gz',
                              header=True,
                              schema=weather_schema)

In [None]:
# We write our data in parquet and load it again
chicago_weather.write.parquet("../Data/temp/chicago_weather.parquet")
chicago_weather = session.read.parquet("../Data/temp/chicago_weather.parquet")

#### 2.2.1 Check the correct load

In [None]:
chicago_weather.show(5)

+-------------------+--------+--------+------------------+-------------------+--------------+----------+
|           datetime|humidity|pressure|       temperature|weather_description|wind_direction|wind_speed|
+-------------------+--------+--------+------------------+-------------------+--------------+----------+
|2012-10-01 13:00:00|    71.0|  1014.0|            284.01|    overcast clouds|           0.0|       0.0|
|2012-10-01 14:00:00|    70.0|  1014.0|284.05469097400004|    overcast clouds|           0.0|       0.0|
|2012-10-01 15:00:00|    70.0|  1014.0|     284.177412183|    overcast clouds|           0.0|       0.0|
|2012-10-01 16:00:00|    70.0|  1014.0|     284.300133393|    overcast clouds|           0.0|       0.0|
|2012-10-01 17:00:00|    69.0|  1014.0|284.42285460200003|    overcast clouds|           0.0|       0.0|
+-------------------+--------+--------+------------------+-------------------+--------------+----------+
only showing top 5 rows



In [None]:
# We check the columns types
chicago_weather.printSchema()

root
 |-- datetime: timestamp (nullable = true)
 |-- humidity: double (nullable = true)
 |-- pressure: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- weather_description: string (nullable = true)
 |-- wind_direction: double (nullable = true)
 |-- wind_speed: double (nullable = true)



#### 2.2.2 Study the dimensions of the dataset

In [None]:
chicago_weather.count()

45252

In [None]:
len(chicago_weather.columns)

7

## 3 Join both Dataset

In this section what we are going to join the datasets of Chicago Taxi Trips and Chicago Weather.

### 3.1 Define the keys to join both dataset

#### 3.1.1 Taxi Trips

In [None]:
# We create a computed field from column 'trip_start_timestamp' and convert it to string format with the aim of 
# then crossing by year, month, day and hour with the weather dataset
taxi_trips = taxi_trips.withColumn("AUX_trip_start_timestamp",
                                   F.col("trip_start_timestamp").cast(T.StringType()))

#### 3.1.2 Weather

In [None]:
# We create a computed field from column 'datetime' and convert it to string format with the aim of 
# then crossing by year, month, day and hour with the traxi_trips dataset
chicago_weather = chicago_weather.withColumn("AUX_datetime",
                                             F.col("datetime").cast(T.StringType()))

### 3.2 Join both Datasets

In [None]:
# We cross both datasets by year, month, day and hour
taxi_trips = taxi_trips.join(chicago_weather,
                             on=F.col('AUX_trip_start_timestamp')[0:13]==F.col('AUX_datetime')[0:13],
                             how='left_outer')

#### 3.2.1 Check the correct join of the data

In [None]:
taxi_trips.count()

29760901

After the join, there have been generated duplicates that we must eliminate

In [None]:
len(taxi_trips.columns)

28

In [None]:
taxi_trips.limit(5).toPandas()

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude,AUX_trip_start_timestamp,datetime,humidity,pressure,temperature,weather_description,wind_direction,wind_speed,AUX_datetime
0,fd7ca20775773ffc5426f012bb13a8e39b21d5ba,b597697c5b962a3f36ed67d274ec82ed1b72232c537b5a...,2015-12-11 08:00:00,2015-12-11 08:15:00,300,0.9,8,28,5.65,0.0,0.0,0.0,5.65,Cash,KOAM Taxi Association,41.893216,-87.637844,41.8853,-87.642808,2015-12-11 08:00:00,2015-12-11 08:00:00,61.0,1005.0,280.94,sky is clear,220.0,2.0,2015-12-11 08:00:00
1,fd7ca2611fd6add8ab90cfa53d24867f72a3c63c,b6881e07d2abb8b6adb5a0962079be3ed8f197afc8127e...,2014-12-13 14:30:00,2014-12-13 15:00:00,1860,17.2,8,76,35.65,7.5,0.0,2.0,45.15,Credit Card,Taxi Affiliation Services,41.892508,-87.626215,41.979071,-87.90304,2014-12-13 14:30:00,2014-12-13 14:00:00,100.0,1037.0,276.479667,scattered clouds,239.0,4.0,2014-12-13 14:00:00
2,fd7ca28deeef5b1e011d953f132d1363c73f6cb7,159b1d5b970137f6b7afee927fd34913c2dfafe07f4f63...,2015-02-27 17:15:00,2015-02-27 18:00:00,2820,10.1,33,77,26.85,5.35,0.0,0.0,32.2,Credit Card,Taxi Affiliation Services,41.849247,-87.624135,41.979912,-87.664188,2015-02-27 17:15:00,2015-02-27 17:00:00,96.0,1051.0,259.549,sky is clear,287.0,3.0,2015-02-27 17:00:00
3,fd7ca2b529d9cddf08e914e9ab9149d192618507,54fda80b28cfa29ae2fcd9bfee10b6ab21f4530e1d883a...,2014-07-10 16:45:00,2014-07-10 17:15:00,1620,4.5,32,8,15.65,0.0,0.0,0.0,15.65,Cash,Taxi Affiliation Services,41.880994,-87.632746,41.892073,-87.628874,2014-07-10 16:45:00,2014-07-10 16:00:00,60.0,1022.0,296.6,broken clouds,0.0,3.0,2014-07-10 16:00:00
4,fd7ca2e384bc137591760eb9fe5804b563e3e26c,f3e9e9bb02a72ad6b998d51c08c17c6791a929a1a8d61d...,2015-04-21 19:45:00,2015-04-21 20:15:00,1680,10.0,76,15,22.25,7.25,0.0,2.0,31.5,Credit Card,Northwest Management LLC,41.980264,-87.913625,41.954028,-87.763399,2015-04-21 19:45:00,2015-04-21 19:00:00,69.0,1016.0,280.485667,broken clouds,265.0,8.0,2015-04-21 19:00:00


#### 3.2.2 Drop  duplicate trips after the join

In [None]:
# We drop the duplicates generated. For that we select the axes 
taxi_trips = taxi_trips.dropDuplicates(['trip_id',
                                        'taxi_id',
                                        'trip_start_timestamp',
                                        'trip_end_timestamp',
                                        'trip_seconds',
                                        'trip_miles',
                                        'pickup_community_area',
                                        'dropoff_community_area',
                                        'fare',
                                        'tips',
                                        'tolls',
                                        'extras',
                                        'trip_total',
                                        'payment_type',
                                        'company',
                                        'pickup_centroid_latitude',
                                        'pickup_centroid_longitude',
                                        'dropoff_centroid_latitude',
                                        'dropoff_centroid_longitude'])

In [None]:
# We find that duplicate trips have been eliminated
taxi_trips.count()

29755922

#### 3.2.2 Remove useless columns

In [None]:
#  We check the columns of the dataset
taxi_trips.columns

['trip_id',
 'taxi_id',
 'trip_start_timestamp',
 'trip_end_timestamp',
 'trip_seconds',
 'trip_miles',
 'pickup_community_area',
 'dropoff_community_area',
 'fare',
 'tips',
 'tolls',
 'extras',
 'trip_total',
 'payment_type',
 'company',
 'pickup_centroid_latitude',
 'pickup_centroid_longitude',
 'dropoff_centroid_latitude',
 'dropoff_centroid_longitude',
 'AUX_trip_start_timestamp',
 'datetime',
 'humidity',
 'pressure',
 'temperature',
 'weather_description',
 'wind_direction',
 'wind_speed',
 'AUX_datetime']

In [None]:
#And we eliminate the ones that are duplicated
taxi_trips = taxi_trips.drop('datetime',
                             'AUX_datetime')

#### 3.2.3 Check the dataset again

In [None]:
len(taxi_trips.columns)

26

In [None]:
taxi_trips.limit(5).toPandas()

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude,AUX_trip_start_timestamp,humidity,pressure,temperature,weather_description,wind_direction,wind_speed
0,25aef066b51c7b4c370ce6d1fdd5384f67047c18,d092ee0cb44dc6fa5f6757a7ef07181c4ee3d25eed0bc4...,2014-03-16 01:30:00,2014-03-16 02:00:00,1800,16.1,32,54,58.25,0.0,0.0,4.5,62.75,Cash,Taxi Affiliation Services,41.878866,-87.625192,41.660136,-87.602848,2014-03-16 01:30:00,80.0,1023.0,272.19,overcast clouds,60.0,7.0
1,8cbdd8b93c1730da858467e69470f81d103fa66d,0b12c0f0c81641a43904bbce23657b88b23abf83aaaed6...,2014-11-22 09:45:00,2014-11-22 10:00:00,1260,17.8,28,54,35.85,0.0,0.0,0.0,35.85,Cash,Taxi Affiliation Services,41.874005,-87.663518,41.660136,-87.602848,2014-11-22 09:45:00,100.0,1034.0,275.616,light rain,208.0,9.0
2,b51cd2c700d637d9c7e834bb738cf8a1409fedd1,0b12c0f0c81641a43904bbce23657b88b23abf83aaaed6...,2016-03-06 09:30:00,2016-03-06 09:45:00,1140,18.1,28,54,44.25,0.0,0.0,0.0,44.25,Cash,Taxi Affiliation Services,41.874005,-87.663518,41.660136,-87.602848,2016-03-06 09:30:00,80.0,1025.0,274.15,overcast clouds,175.0,4.0
3,39846cf69124e7c3c6fb3add96e2fd6f180fc8b2,c6147dad19c7f61319ff0644f1b59ef2a077c3274fbb94...,2013-06-07 08:00:00,2013-06-07 08:30:00,1380,18.6,28,55,37.45,0.0,0.0,0.0,37.45,Cash,Dispatch Taxi Affiliation,41.874005,-87.663518,41.663671,-87.540936,2013-06-07 08:00:00,82.0,1012.0,285.95,light rain,53.0,2.0
4,56935737fc8af8ad8b71e0a118586d462bbb5608,fb318ecb67178004b27e8a37d5214da86679b46764daf8...,2016-03-18 13:45:00,2016-03-18 14:15:00,1800,20.9,32,55,51.5,0.0,0.0,0.0,51.5,Cash,Taxi Affiliation Services,41.878866,-87.625192,41.663671,-87.540936,2016-03-18 13:45:00,55.0,1018.0,275.2,sky is clear,280.0,7.0


In [None]:
taxi_trips.printSchema()

root
 |-- trip_id: string (nullable = true)
 |-- taxi_id: string (nullable = true)
 |-- trip_start_timestamp: timestamp (nullable = true)
 |-- trip_end_timestamp: timestamp (nullable = true)
 |-- trip_seconds: integer (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- pickup_community_area: integer (nullable = true)
 |-- dropoff_community_area: integer (nullable = true)
 |-- fare: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- extras: double (nullable = true)
 |-- trip_total: double (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- company: string (nullable = true)
 |-- pickup_centroid_latitude: double (nullable = true)
 |-- pickup_centroid_longitude: double (nullable = true)
 |-- dropoff_centroid_latitude: double (nullable = true)
 |-- dropoff_centroid_longitude: double (nullable = true)
 |-- AUX_trip_start_timestamp: string (nullable = true)
 |-- humidity: double (nullable = true)
 |-- pressure: do

In [None]:
#We save the dataset in parquet and we load it again
taxi_trips.write.parquet("../Data/temp/taxi_weather_trips.parquet")
taxi_trips = session.read.parquet("../Data/temp/taxi_weather_trips.parquet")

## 4 Outliers

In this section what we are going to do is to analyze the union of both datasets and filter those trips that have strange values.


In [None]:
# We are going to create the column speed in miles/h to analyze the trips which later we will eliminate
taxi_trips = taxi_trips.withColumn("speed",
                                   3600*(F.col("trip_miles")/F.col("trip_seconds")).cast(T.FloatType()))

### 4.1 Get the descriptive statistics of some variables

In [None]:
# We select the variables, we are going to analyze
analysed_variables = ['trip_seconds',
                      'trip_miles',
                      'speed',
                      'fare',
                      'tips',
                      'tolls',
                      'extras',
                      'trip_total',
                      'humidity',
                      'pressure',
                      'temperature',
                      'wind_direction',
                      'wind_speed']

In [None]:
# We get the descriptive stadistics of these columns
thresholds={}
for var in analysed_variables:
    print(var)
    describe = taxi_trips.select(F.round(F.mean(var), 2).alias("mean"),
                                F.min(var).alias("min"),
                                F.max(var).alias("max"),
                                F.round(F.stddev(var), 2).alias("stddv"))
    q1, median, q3 = taxi_trips.approxQuantile(var, [0.25, 0.5, 0.75], 0)
    iqr = q3 - q1
    describe = describe.withColumn("q1", F.lit(q1))
    describe = describe.withColumn("median", F.lit(median))
    describe = describe.withColumn("q3", F.lit(q3))
    describe = describe.withColumn("iqr", F.lit(iqr))
    describe = describe.withColumn("threshold", 1.5*F.lit(iqr))
    describe.show()
    describe_list = describe.collect()
    for value in describe_list:
        element = [i for i in value]
        thresholds[var]=[element[8],
                         element[4],
                         element[6]]

trip_seconds
+------+---+-----+------+-----+------+------+-----+---------+
|  mean|min|  max| stddv|   q1|median|    q3|  iqr|threshold|
+------+---+-----+------+-----+------+------+-----+---------+
|883.11|120|83520|757.98|420.0| 660.0|1080.0|660.0|    990.0|
+------+---+-----+------+-----+------+------+-----+---------+

trip_miles
+----+---+------+-----+---+------+---+---+---------+
|mean|min|   max|stddv| q1|median| q3|iqr|threshold|
+----+---+------+-----+---+------+---+---+---------+
|4.26|0.6|1998.1| 8.58|1.1|   1.8|4.2|3.1|     4.65|
+----+---+------+-----+---+------+---+---+---------+

speed
+-----+----------+-------+-----+---+------+----+---+---------+
| mean|       min|    max|stddv| q1|median|  q3|iqr|threshold|
+-----+----------+-------+-----+---+------+----+---+---------+
|16.33|0.02962963|57492.0|40.26|9.0|  12.0|18.0|9.0|     13.5|
+-----+----------+-------+-----+---+------+----+---+---------+

fare
+-----+----+-------+-----+----+------+-----+---+------------------+
| me

In [None]:
#We get the number of taxis in the dataset
taxi_trips.agg(F.countDistinct(F.col('taxi_id')).alias('taxis_chicago')).show()

In [None]:
#We get the number of distinct payment methods
taxi_trips.agg(F.countDistinct(F.col('payment_type')).alias('payment_methods')).show()

In [None]:
#We get the number of distinct companies
taxi_trips.agg(F.countDistinct(F.col('company')).alias('companies')).show()

In [None]:
#We get the number of distinct weather descriptions
taxi_trips.agg(F.countDistinct(F.col('weather_description')).alias('weather_descriptions')).show()

## 4.2 Get the number of outliers and the percentage

In [None]:
# After the statistical analysis we obtain the number of outliers and the percentage of these over the total.
outliers = {}
n_rows=taxi_trips.count()
for i in thresholds:
    outlier = taxi_trips.filter(F.col(i) < (thresholds[i][1]-thresholds[i][0])).count() \
    + taxi_trips[F.col(i) > (thresholds[i][2]+thresholds[i][0])].count()
    outliers[i] = [outlier,outlier/n_rows*100]
df_outliers = pd.DataFrame(outliers).T
df_outliers.columns = ['outliers_number','%_outliers']
df_outliers = df_outliers.sort_values('%_outliers',ascending=False)
df_outliers

## 4.3 Plot the outliers

In [None]:
# We plot the percentage of outliers obtained
ax=df_outliers['%_outliers'].plot(kind='bar', title ="Outliers", figsize=(15, 10), \
                                         fontsize=12, rot=45)
ax.set_xlabel("Features", fontsize=12)
ax.set_ylabel("%_outliers", fontsize=12)
plt.show()

## 4.4 Remove trips with extrange values

We are not going to use the ouliers obtained but we are going to analyze the columns susceptible of being eliminated by strange values after analyzing their statistical characteristics.

### 4.4.1 Number of trips depending on the filter by duration

In [None]:
# We get the number of trips by duration
hours = [1,1.5,2,2.5,3]
dicc_trip_seconds={}
for duration in hours:
    duration = duration * 3600
    dicc_trip_seconds[duration/3600] = [taxi_trips.filter(F.col('trip_seconds') <= duration).count()]

# We convert it to pandas    
trip_duration = (pd.DataFrame(dicc_trip_seconds).T).reset_index()
trip_duration.columns = ['duration_hours','number_trips']
trip_duration

### 4.4.2 Number of trips depending on the filter by path length

In [None]:
# We get the number of trips by path length
miles = [10,15,20,50,100]
dicc_trip_miles={}
for length in miles:
    dicc_trip_miles[str(length)] = [taxi_trips.filter(F.col('trip_miles') <= length).count()]

# We convert it to pandas    
trip_length = (pd.DataFrame(dicc_trip_miles).T).reset_index()
trip_length.columns = ['length_miles','number_trips']
trip_length

### 4.4.3 Number of trips depending on the filter by fare

In [None]:
# We get the number of trips by fare
fares = [100,200,500]
dicc_trip_fare={}
for fare in fares:
    dicc_trip_fare[str(fare)] = [taxi_trips.filter(F.col('fare') <= fare).count()]

# We convert it to pandas    
trip_fare = (pd.DataFrame(dicc_trip_fare).T).reset_index()
trip_fare.columns = ['fares_$','number_trips']
trip_fare

### 4.4.4 Number of trips depending on the filter by toll

In [None]:
# We get the number of trips by tolls
tolls = [5,10,20,50,100]
dicc_trip_toll={}
for toll in tolls:
    dicc_trip_toll[str(toll)] = [taxi_trips.filter(F.col('tolls') <= toll).count()]

# We convert it to pandas    
trip_toll = (pd.DataFrame(dicc_trip_toll).T).reset_index()
trip_toll.columns = ['tolls_$','number_trips']
trip_toll

### 4.4.5 Number of trips depending on the filter by extra

In [None]:
# We get the number of trips by extra
extras = [5,10,20,50,100,1000]
dicc_trip_extra={}
for extra in extras:
    dicc_trip_extra[str(extra)] = [taxi_trips.filter(F.col('extras') <= extra).count()]

# We convert it to pandas    
trip_extra = (pd.DataFrame(dicc_trip_extra).T).reset_index()
trip_extra.columns = ['extras_$','number_trips']
trip_extra

### 4.4.6 Number of trips depending by type of payment

In [None]:
# We get the number of trips by payment
payment_type = taxi_trips.groupby("payment_type").count().sort('count',ascending=False).toPandas()

# We convert it to pandas    
payment_type.columns = ['payment_type','number_trips']
payment_type

## 4.5 Filter the trips

In [None]:
# We filter the dataframe after our study
print('Number of trips with outliers: ',taxi_trips.count())
taxi_trips = taxi_trips.filter((F.col('trip_seconds') <= 3600)& 
                  (F.col('trip_miles')<=150)&
                  (F.col('speed')<=120)&
                  (F.col('fare') <= 100) & 
                  (F.col('fare')>=3.25) &
                  (F.col('fare')>F.col('tips'))&
                  (F.col('tips') <= (0.4*F.col('fare')))&
                  (F.col('tolls')<=20)&
                  (F.col('extras')<=20))
print('Number of trips without outliers: ',taxi_trips.count())

In [None]:
taxi_trips.write.parquet("../Data/temp/taxi_df.parquet")

In [None]:
taxi_trips = session.read.parquet("../Data/temp/taxi_df.parquet")

## 4.6 Stadistics after the filter

In [None]:
# We get the descriptive statistis of these variables again after the filtering carried out
thresholds={}
for var in analysed_variables:
    print(var)
    describe = taxi_trips.select(F.round(F.mean(var), 2).alias("mean"),
                                F.min(var).alias("min"),
                                F.max(var).alias("max"),
                                F.round(F.stddev(var), 2).alias("stddv"))
    q1, median, q3 = taxi_trips.approxQuantile(var, [0.25, 0.5, 0.75], 0)
    iqr = q3 - q1
    describe = describe.withColumn("q1", F.lit(q1))
    describe = describe.withColumn("median", F.lit(median))
    describe = describe.withColumn("q3", F.lit(q3))
    describe = describe.withColumn("iqr", F.lit(iqr))
    describe = describe.withColumn("threshold", 1.5*F.lit(iqr))
    describe.show()

## 5 Exploratory data analysis

### 5.1 Analysis trips, fare and  tip by year, year-month, day of week and hour

We will analyze trips, fares and tips provided on trips made by year, year-month, day of the week and hour

#### 5.1.1 Year

In [None]:
df_trips_year = taxi_trips.groupby(F.year("trip_start_timestamp").alias("year"))\
                          .agg(F.count('taxi_id').alias('trips_number'),
                               F.sum('fare').alias('mean_fare'),
                               F.sum('tips').alias('total_tips'),                               
                               F.mean('tips').alias('tips_mean'))\
                          .sort('year').toPandas()

##### 5.1.1.1 Number of trips by year

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_year["year"], y=df_trips_year["trips_number"]).set_title('Trips by Year $');

##### 5.1.1.2 Mean fare by year

In [None]:
f, axes = plt.subplots( sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_year["year"], y=df_trips_year['mean_fare']).set_title('Mean Fare by Year $');

##### 5.1.1.3 Total tips by year

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_year["year"], y=df_trips_year["total_tips"]).set_title('Total Tips by Year $');

##### 5.1.1.4 Mean of tips per year

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_year["year"], y=df_trips_year["tips_mean"]).set_title('Mean Tips by Year $');

#### 5.1.2 Year - Month

In [None]:
df_trips_year_month = taxi_trips.groupby(F.col('AUX_trip_start_timestamp')[0:7].alias('year_month'))\
                          .agg(F.count('taxi_id').alias('trips_number'),
                               F.sum('fare').alias('mean_fare'),
                               F.sum('tips').alias('total_tips'),                               
                               F.mean('tips').alias('tips_mean'))\
                          .sort('year_month').toPandas()

##### 5.1.2.1 Number of trips by year-month

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(20, 8))
sns.barplot(x=df_trips_year_month["year_month"], 
            y=df_trips_year_month["trips_number"]).set_title('Trips by Year-Month')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.1.1.2 Mean fare by year-month

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(20, 8))
sns.barplot(x=df_trips_year_month["year_month"], 
            y=df_trips_year_month["mean_fare"]).set_title('Mean Fare by Year-Month')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.1.2.3 Total tips by month

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(20, 8))
sns.barplot(x=df_trips_year_month["year_month"], 
            y=df_trips_year_month["total_tips"]).set_title('Total Tips by Year-Month')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.1.2.4 Mean tips by month

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(20, 8))
sns.barplot(x=df_trips_year_month["year_month"], 
            y=df_trips_year_month["tips_mean"]).set_title('Mean Tips by Year-Month')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.1.3 WeekDay

In [None]:
df_trips_dayweek = taxi_trips.groupby(F.dayofweek("trip_start_timestamp").alias("dayweek"))\
                          .agg(F.count('taxi_id').alias('trips_number'),
                               F.sum('fare').alias('mean_fare'),
                               F.sum('tips').alias('total_tips'),
                               F.mean('tips').alias('tips_mean'))\
                          .sort('dayweek').toPandas()

##### 5.1.3.1 Number of trips by weekday

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_dayweek["dayweek"], y=df_trips_dayweek["trips_number"]).set_title('Trips by Day of Week');

#### 5.1.3.2 Mean fare by day of week

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_dayweek["dayweek"],
            y=df_trips_dayweek['mean_fare']).set_title('Mean Fare by Day of week $');

#### 5.1.3.3 Total of tips by day of week

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_dayweek["dayweek"],
            y=df_trips_dayweek['total_tips']).set_title('Total of Tips by Day of week $');

#### 5.1.4 Hour

In [None]:
df_trips_hour = taxi_trips.groupby(F.hour("trip_start_timestamp").alias("hour"))\
                          .agg(F.count('taxi_id').alias('trips_number'),
                               F.mean('fare').alias('mean_fare'),
                               F.sum('tips').alias('total_tips'),
                               F.mean('tips').alias('tips_mean'),
                               F.sum('tolls').alias('total_tolls'),
                               F.mean('tolls').alias('tolls_mean'),)\
                          .sort('hour').toPandas()

#### 5.1.4.1 Total trips by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["trips_number"]).set_title('Total Trips by Hour');

#### 5.1.4.2 Mean fare by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["mean_fare"]).set_title('Mean Fare by Hour');

#### 5.1.4.3 Total tips by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["total_tips"]).set_title('Total Tips by Hour');

#### 5.1.4.4 Mean tips by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["tips_mean"]).set_title('Mean Tips by Hour');

#### 5.1.4.5 Total tolls by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["total_tolls"]).set_title('Total Tolls by Hour');

#### 5.1.4.6 Mean tolls by hour

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=df_trips_hour["hour"], y=df_trips_hour["tolls_mean"]).set_title('Mean Tolls by Hour');

We can observe how at 5, 6, 7 in the morning is when fewer trips are made but in which the average fare is higher which makes us think that longer trips are made as for example from or to the airport.

Furthermore, we also observe how it is producing a decline in taxi trips as a result of the appearance of Uber.

### 5.2 Analysis trips, fare, trip miles and trip duration by company

We will analyze trips, fare, trip miles and trip duration by company

In [None]:
df_company = taxi_trips.groupby(F.col('company'))\
                          .agg(F.count('taxi_id').alias('trips_number'),
                               F.sum('fare').alias('total_fare'),
                               F.mean('fare').alias('fare_mean'),                               
                               F.mean('tips').alias('tips_mean'),
                               F.mean('trip_miles').alias('miles_mean'),
                               F.lit(F.mean('trip_seconds')/60).alias('duration_mean'))

#### 5.2.1 TOP 10 Total Trips by Company

In [None]:
company_trips = df_company.sort('trips_number',ascending=False).limit(10).toPandas()
company_trips

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_trips["company"],
            y=company_trips['trips_number']).set_title('TOP 10 Total Trips by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.2.2 TOP 10 Total Fare by Company

In [None]:
company_total_fare = df_company.sort('total_fare',ascending=False).limit(10).toPandas()
company_total_fare

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_total_fare["company"],
            y=company_total_fare['total_fare']).set_title('TOP 10 Total Fare by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.2.3 TOP 10 Mean Fare by Company

In [None]:
company_mean_fare = df_company.sort('fare_mean',ascending=False).limit(10).toPandas()
company_mean_fare

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_mean_fare["company"],
            y=company_mean_fare['fare_mean']).set_title('TOP 10 Mean Fare by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.2.4 TOP 10 Mean Tips by Company

In [None]:
company_mean_tips = df_company.sort('tips_mean',ascending=False).limit(10).toPandas()
company_mean_tips

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_mean_tips["company"],
            y=company_mean_tips['tips_mean']).set_title('TOP 10 Mean Tips by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.2.5 TOP 10 Longest Trips(miles) by Company

In [None]:
company_lenght_mean = df_company.sort('miles_mean',ascending=False).limit(10).toPandas()
company_lenght_mean

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_lenght_mean["company"],
            y=company_lenght_mean['miles_mean']).set_title('TOP 10 Longest Trips (miles) by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.2.6 TOP 10 Longest Trips(minutes) by Company

In [None]:
company_duration_mean = df_company.sort('duration_mean',ascending=False).limit(10).toPandas()
company_duration_mean

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=company_duration_mean["company"],
            y=company_duration_mean['duration_mean']).set_title('TOP 10 Longest Trips (minutes) by Company')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

Except "Blue Ribbon Taxi Association Inc. the rest of the companies that collect the most, that make the most trips, that have the highest mean fares, that make the longest trips are not among the 10 companies with the highest average fare.

### 5.3 Analysis trips, fare, trip miles and trip duratinon by weather description

We will analyze trips, fare, trip miles and trip duration by weather description

In [None]:
df_weather_description = taxi_trips.groupby(F.col('weather_description'))\
                                   .agg(F.count('taxi_id').alias('trips_number'),
                                        F.sum('fare').alias('total_fare'),
                                        F.mean('fare').alias('fare_mean'),                               
                                        F.mean('tips').alias('tips_mean'),
                                        F.mean('trip_miles').alias('miles_mean'),
                                        F.lit(F.mean('trip_seconds')/60).alias('duration_mean'))

#### 5.3.1 TOP 15 Total Trips by Weather Description

In [None]:
weather_total_trips = df_weather_description.sort('trips_number',ascending=False).limit(15).toPandas()
weather_total_trips

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_total_trips["weather_description"],
            y=weather_total_trips['trips_number']).set_title('TOP 15 Total Trips by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.3.2 TOP 15 Total Fare by Weather Description

In [None]:
weather_total_fares = df_weather_description.sort('total_fare',ascending=False).limit(15).toPandas()
weather_total_fares

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_total_fares["weather_description"],
            y=weather_total_fares['total_fare']).set_title('TOP 15 Total Fares by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 60);

#### 5.3.3 TOP 15 Total Fare by Weather Description

In [None]:
weather_mean_fares = df_weather_description.sort('fare_mean',ascending=False).limit(15).toPandas()
weather_mean_fares

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_mean_fares["weather_description"],
            y=weather_mean_fares['fare_mean']).set_title('TOP 15 Mean Fares by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 75);

#### 5.3.4 TOP 15 Tips Mean by Weather Description

In [None]:
weather_mean_tips = df_weather_description.sort('tips_mean',ascending=False).limit(15).toPandas()
weather_mean_tips

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_mean_tips["weather_description"],
            y=weather_mean_tips['tips_mean']).set_title('TOP 15 Mean Tips by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 75);

#### 5.3.5 TOP 15 Longest Trips(miles)  by Weather Description

In [None]:
weather_lenght_mean = df_weather_description.sort('miles_mean',ascending=False).limit(15).toPandas()
weather_lenght_mean

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_lenght_mean["weather_description"],
            y=weather_lenght_mean['miles_mean']).set_title('TOP 15 Longest Trips (miles) by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 75);

#### 5.3.6 TOP 15 Longest Trips(minutes) by Weather Description

In [None]:
weather_duration_mean = df_weather_description.sort('duration_mean',ascending=False).limit(15).toPandas()
weather_duration_mean

In [None]:
f, axes = plt.subplots(sharey=True, figsize=(10, 6))
sns.barplot(x=weather_duration_mean["weather_description"],
            y=weather_duration_mean['duration_mean']).set_title('TOP 15 Longest Trips (minutes) by Weather Description')
axes.set_xticklabels(axes.get_xticklabels(), rotation = 75);

We can observe that while the largest number of trips and the largest total amount of money has been obtained by trips that had good weather or a little rain, the highest average fares, the highest tips and the longest trips have been in extreme conditions.

## 6 Feauture Engineering

In this part of the notebook we are going to create new features referring to dates, distances and to convert the column of the taxi identifier.

### 6.1 Dates

From column `trip_start_timestamp` we will get the fields:
- Year
- Month
- Day
- Week Day
- Hour
- Minute

#### 6.1.1 Get the year 

In [None]:
taxi_trips = taxi_trips.withColumn("year",
                                   F.year("trip_start_timestamp").cast(T.IntegerType()))

#### 6.1.2 Get the month 

In [None]:
taxi_trips = taxi_trips.withColumn("month",
                                   F.month("trip_start_timestamp").cast(T.IntegerType()))

#### 6.1.3 Get the day 

In [None]:
taxi_trips = taxi_trips.withColumn("day",
                                   F.dayofmonth("trip_start_timestamp").cast(T.IntegerType()))

#### 6.1.4 Get the week day 

In [None]:
taxi_trips = taxi_trips.withColumn("week_day",
                                   F.dayofweek("trip_start_timestamp").cast(T.IntegerType()))

#### 6.1.5 Get the hour 

In [None]:
taxi_trips = taxi_trips.withColumn("hour",
                                   F.hour("trip_start_timestamp").cast(T.IntegerType()))

#### 6.1.6 Get the minute 

In [None]:
taxi_trips = taxi_trips.withColumn("minute",
                                   F.minute("trip_start_timestamp").cast(T.IntegerType()))

### 6.2 Distance

We are going to create 3 fields that allow us to know the distance in miles that exists between the `PickUp Point` and the `DropOff Point`.

In addition we will create two other fields that allow us to know the distance from airports `O'Hare International Airport` and `Midway International Airport` to the points of departure or arrival (the shortest distance will be obtained)

#### 6.2.1 Get the distance between the pickup and dropoff points  

In [None]:
# We define the function to get the distance in miles
def distance(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):

    # Earth circumference in miles
    equator_circumference = 24901.461
    poles_circumference = 24859.73    
    
    # Compute distances along lat, lon dimensions
    lat_distance = ((dropoff_lat - pickup_lat)*poles_circumference)/360
    long_distance = ((dropoff_lon - pickup_lon)*equator_circumference)/360
    
    #Compute distance in a straight line
    distance = (lat_distance**2 + long_distance**2)**0.5 
    
    return distance

In [None]:
# We get the udf
distance_udf = F.udf(distance, T.FloatType())

In [None]:
# We get the new computad column
taxi_trips = taxi_trips.withColumn("distance_miles",
                                   distance_udf("pickup_centroid_latitude",
                                                "pickup_centroid_longitude",
                                                "dropoff_centroid_latitude",
                                                "dropoff_centroid_longitude"))

#### 6.2.2 Get the distance between the pickup and dropoff points with the airports of Chicago

In [None]:
"""
ORD: O'Hare International Airport
MDW: Midway International Airport
"""
ord_coord = [41.978611, -87.904722]
mdw_coord = [41.786111, -87.7525]

In [None]:
# We define the function to get the shortest distance in miles to the airports
def distance_airport(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon,airport_lat,airport_lon):
    
    pickup_airport = distance(pickup_lat, pickup_lon, airport_lat, airport_lon) 
    airport_dropoff = distance(airport_lat, airport_lon, dropoff_lat, dropoff_lon) 
    return min(pickup_airport, airport_dropoff)

# We get the udf
distance_airport_udf = F.udf(distance_airport, T.FloatType())

##### 6.2.2.1 Get the distance between the pickup and dropoff points with the O'Hare International Airport

In [None]:
taxi_trips = taxi_trips.withColumn("distance_ord",
                                   distance_airport_udf("pickup_centroid_latitude",
                                                        "pickup_centroid_longitude",
                                                        "dropoff_centroid_latitude",
                                                        "dropoff_centroid_longitude",
                                                        F.lit(ord_coord[0]),
                                                        F.lit(ord_coord[1])))

##### 6.2.2.2 Get the distance between the pickup and dropoff points with the Midway International Airport

In [None]:
taxi_trips = taxi_trips.withColumn("distance_mdw",
                                   distance_airport_udf("pickup_centroid_latitude",
                                                        "pickup_centroid_longitude",
                                                        "dropoff_centroid_latitude",
                                                        "dropoff_centroid_longitude",
                                                        F.lit(mdw_coord[0]),
                                                        F.lit(mdw_coord[1])))

In [None]:
taxi_trips.write.parquet("../Data/temp/taxi_model.parquet")

In [None]:
taxi_trips = session.read.parquet("../Data/temp/taxi_model.parquet")

### 6.3 Convert string variables to integer variables

We are going to get the integer format from the string columns to integer value

### 6.3.1 Dataset we are going to use to visualized

In [None]:
# We drop the columns that do not interest us for the visualization
taxi_visualized = taxi_trips.drop('trip_end_timestamp',
                                  'year',
                                  'month',
                                  'day',
                                  'week_day',
                                  'hour',
                                  'minute',
                                  'AUX_trip_start_timestamp',
                                  'speed')

In [None]:
# We convert the taxi id string to integer
enc_columns=['taxi_id']
for column in enc_columns:
    feauture_indexer = StringIndexer(inputCol=column,outputCol=column+"_ind")
    indexer_model = feauture_indexer.fit(taxi_visualized)
    taxi_visualized = indexer_model.transform(taxi_visualized)
    taxi_visualized=taxi_visualized.withColumn(column+"_ind",F.col(column+"_ind").cast(T.IntegerType()))

In [None]:
# We define the function to change the csv names
def change_file_name(path,extension='.csv'):
    result = os.listdir(path)
    file = list(filter(lambda x: x.endswith(extension), result))
    old_path = path + file[0]
    new_path = path[:-1] + extension
    return os.rename(old_path, new_path)

In [None]:
# We get the dataset that we are going to use to visualized
taxi_visualized.drop('taxi_id')\
               .repartition(1).write.save("../Data/taxi_visualized",format="csv",header=True)

In [None]:
#We change the csv name and delete the table
change_file_name('../Data/taxi_visualized/')
shutil.rmtree('../Data/taxi_visualized')

In [None]:
# We get a sample of the dataset that we are going to use to visualized with which we will build our tableau 
# and then use the complete dataset.
taxi_visualized.sample(0.001).drop('taxi_id')\
               .repartition(1).write.save("../Data/taxi_visualized_sample",format="csv",header=True)

In [None]:
#We change the csv name and delete the table
change_file_name('../Data/taxi_visualized_sample/')
shutil.rmtree('../Data/taxi_visualized_sample')

### 6.3.2 Dataset we are going to use to build the model

In [None]:
enc_columns=['taxi_id','payment_type','company','weather_description']

In [None]:
# We convert the enc_columns from string to integer
for column in enc_columns:
    feauture_indexer = StringIndexer(inputCol=column,outputCol=column+"_ind")
    indexer_model = feauture_indexer.fit(taxi_trips)
    taxi_trips = indexer_model.transform(taxi_trips)
    taxi_trips = taxi_trips.withColumn(column+"_ind",F.col(column+"_ind").cast(T.IntegerType()))

### 6.4 Delete the variables that we are not going to use to model

In [None]:
taxi_trips=taxi_trips.drop('trip_end_timestamp',
                           'trip_miles',
                           'tips',
                           'tolls',
                           'extras',
                           'trip_total',
                           'AUX_trip_start_timestamp',
                           'speed')

## 7 Correlation Matrix

In this section we are going to build the correlation matrix to get an idea of the variables that most affect when it comes to obtaining the fare.

### 7.1 Get the variables to be used to construct the matrix

In [None]:
# We get the numerical variables to construct the matrix correlation
numeric_vars = []
for col, tipo in taxi_trips.dtypes:
    if tipo=="int" or tipo=="float" or tipo=="double" :
        numeric_vars.append(col)

### 7.2 Construct the correlation matrix

In [None]:
numeric_df = taxi_trips.select(numeric_vars)
corr_matrix = Statistics.corr(numeric_df.rdd.map(lambda x: Vectors.dense(x)))

In [None]:
corr_matrix = pd.DataFrame(corr_matrix, 
                           columns = numeric_vars, 
                           index = numeric_vars)

In [None]:
mask = np.zeros_like(corr_matrix, dtype=np.bool)
index_mask = np.triu_indices_from(mask)
mask[index_mask] = True

In [None]:
corr_matrix = corr_matrix.mask(mask)
corr_matrix

### 7.3 Plot the correlation matrix

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 15))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, annot= True, fmt= '.1f',cmap='PiYG', vmin= -1, vmax=1,center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

## 8 Get the final dataset to model

### 8.1 Prepare the dataset we are going to use to model

In [None]:
taxi_trips.columns

In [None]:
# We sort the dataset by date
taxi_trips=taxi_trips.sort('trip_start_timestamp')

In [None]:
# We delete the columns that are not useful for our model
taxi_trips = taxi_trips.drop('taxi_id',
                             'trip_start_timestamp',
                             'taxi_id_ind'
                             'payment_type_ind',
                             'company_ind',
                             'weather_description_ind')

In [None]:
taxi_trips.write.parquet("../Data/temp/taxi_model_final.parquet")

In [None]:
taxi_trips = session.read.parquet("../Data/temp/taxi_model_final.parquet")

In [None]:
taxi_trips = taxi_trips.filter(F.col('company')=='Taxi Affiliation Services')

### 8.2 Save the full dataset

In [None]:
# We save the dataset we are goingt to use to model
taxi_trips.repartition(1).write.save("../Data/taxi_model",
                                                      format="csv",header=True)

In [None]:
#We change the csv name and delete the table
change_file_name('../Data/taxi_model/')
shutil.rmtree('../Data/taxi_model')

In [None]:
# We save the dataset we are going to use to build our model to later use the complete dataset
taxi_trips.sample(0.01,seed=13).repartition(1).write.save("../Data/taxi_model_sample001",
                                                      format="csv",header=True)

In [None]:
#We change the csv name and delete the table
change_file_name('../Data/taxi_model_sample001/')
shutil.rmtree('../Data/taxi_model_sample001')

## 9 Conclusion

Finally we have created the csv `taxi_model_sample001.csv` which is 1% of the entire dataset. This dataset will be used to build our model and calculate the hyperparameters that best suit our analysis. After calculating the hyperparameters we will train our complete dataset `taxi_model_sample.csv` with these hyperparameters and the method that best result contributes.

Regarding the visualization, as we have done for the modeling, we will build our Tableau with the csv `taxi_visualized_sample.csv` for once built and visualized what we want to load the complete csv `taxi_visualized.csv`.

## 10 Delete the temporal folder

In [None]:
# We delete the temporal folder with the parquet files
shutil.rmtree('../Data/temp')