# Raw Analytics - Uber & Ola Ride Booking & Cancellation Data

---

Etapa de análise dos dados brutos, com o foco em análise exploratória de dados (EDA).
Dessa parte se espera compreender a estrutura, comportamento e a qualidade dos dados, o que servira para as transformações, limpezas e enriquecimentos que virão a seguir.

Os principáis objetivos dessa análise exploratória são:

* Validação e Integridade
* Data Profiling e Categorização
* Avaliação de Qualidade dos Dados
* Identificação de Visualização de Outliers
* Detecção de Anomalias

# Bibliotecas

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
import seaborn.objects as so

# Iniciando sessão PySpark

In [2]:
spark = SparkSession.builder.appName("raw").getOrCreate()
spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/17 23:47:46 WARN Utils: Your hostname, CyberCore, resolves to a loopback address: 127.0.1.1; using 192.168.1.3 instead (on interface wlp0s20f3)
26/01/17 23:47:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/17 23:47:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Carregamento dos Dados e Visualização Inicial

In [None]:
df_raw = spark.read.csv("dados_brutos.csv", header=True, inferSchema=True, nullValue="null", emptyValue=None, nanValue="nan")
df_raw.show(n=10)

+-------------------+--------+-------------+--------------------+-----------+------------+-----------------+-------------+-----+-----+--------------------------+------------------------+----------------+-----------------------+-------------+--------------+-------------+--------------+---------------+--------------+----+
|               Date|    Time|   Booking_ID|      Booking_Status|Customer_ID|Vehicle_Type|  Pickup_Location|Drop_Location|V_TAT|C_TAT|Canceled_Rides_by_Customer|Canceled_Rides_by_Driver|Incomplete_Rides|Incomplete_Rides_Reason|Booking_Value|Payment_Method|Ride_Distance|Driver_Ratings|Customer_Rating|Vehicle Images|_c20|
+-------------------+--------+-------------+--------------------+-----------+------------+-----------------+-------------+-----+-----+--------------------------+------------------------+----------------+-----------------------+-------------+--------------+-------------+--------------+---------------+--------------+----+
|2024-07-26 14:00:00|14:00:00|CNR7

26/01/18 00:04:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Date, Time, Booking_ID, Booking_Status, Customer_ID, Vehicle_Type, Pickup_Location, Drop_Location, V_TAT, C_TAT, Canceled_Rides_by_Customer, Canceled_Rides_by_Driver, Incomplete_Rides, Incomplete_Rides_Reason, Booking_Value, Payment_Method, Ride_Distance, Driver_Ratings, Customer_Rating, Vehicle Images, null
 Schema: Date, Time, Booking_ID, Booking_Status, Customer_ID, Vehicle_Type, Pickup_Location, Drop_Location, V_TAT, C_TAT, Canceled_Rides_by_Customer, Canceled_Rides_by_Driver, Incomplete_Rides, Incomplete_Rides_Reason, Booking_Value, Payment_Method, Ride_Distance, Driver_Ratings, Customer_Rating, Vehicle Images, _c20
Expected: _c20 but found: null
CSV file: file:///home/daniel-barros/Documentos/Faculdade/Sistemas%20de%20Banco%20de%20Dados%202%20-%20Thiago%20Luiz/car_rides_analytics/Data%20Layer/raw/dados_brutos.csv


# Validação e Integridade

In [18]:
print(f"\n{'='*10} COLUNAS E SUAS INFORMAÇÕES ESTTUTURAIS {'='*10}\n")
df_raw.printSchema()



root
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Booking_ID: string (nullable = true)
 |-- Booking_Status: string (nullable = true)
 |-- Customer_ID: string (nullable = true)
 |-- Vehicle_Type: string (nullable = true)
 |-- Pickup_Location: string (nullable = true)
 |-- Drop_Location: string (nullable = true)
 |-- V_TAT: string (nullable = true)
 |-- C_TAT: string (nullable = true)
 |-- Canceled_Rides_by_Customer: string (nullable = true)
 |-- Canceled_Rides_by_Driver: string (nullable = true)
 |-- Incomplete_Rides: string (nullable = true)
 |-- Incomplete_Rides_Reason: string (nullable = true)
 |-- Booking_Value: string (nullable = true)
 |-- Payment_Method: string (nullable = true)
 |-- Ride_Distance: string (nullable = true)
 |-- Driver_Ratings: string (nullable = true)
 |-- Customer_Rating: string (nullable = true)
 |-- Vehicle Images: string (nullable = true)
 |-- _c20: string (nullable = true)



In [19]:
print(f"\n{'='*10} ESTATÍSTICAS BÁSICAS DAS COLUNAS {'='*10}\n")

df_raw.summary().show()





26/01/18 00:04:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Date, Time, Booking_ID, Booking_Status, Customer_ID, Vehicle_Type, Pickup_Location, Drop_Location, V_TAT, C_TAT, Canceled_Rides_by_Customer, Canceled_Rides_by_Driver, Incomplete_Rides, Incomplete_Rides_Reason, Booking_Value, Payment_Method, Ride_Distance, Driver_Ratings, Customer_Rating, Vehicle Images, null
 Schema: Date, Time, Booking_ID, Booking_Status, Customer_ID, Vehicle_Type, Pickup_Location, Drop_Location, V_TAT, C_TAT, Canceled_Rides_by_Customer, Canceled_Rides_by_Driver, Incomplete_Rides, Incomplete_Rides_Reason, Booking_Value, Payment_Method, Ride_Distance, Driver_Ratings, Customer_Rating, Vehicle Images, _c20
Expected: _c20 but found: null
CSV file: file:///home/daniel-barros/Documentos/Faculdade/Sistemas%20de%20Banco%20de%20Dados%202%20-%20Thiago%20Luiz/car_rides_analytics/Data%20Layer/raw/dados_brutos.csv

+-------+------------------+--------+-------------+--------------------+-----------+------------+---------------+-------------+------------------+------------------+--------------------------+------------------------+----------------+-----------------------+-----------------+--------------+------------------+------------------+------------------+--------------+----+
|summary|              Date|    Time|   Booking_ID|      Booking_Status|Customer_ID|Vehicle_Type|Pickup_Location|Drop_Location|             V_TAT|             C_TAT|Canceled_Rides_by_Customer|Canceled_Rides_by_Driver|Incomplete_Rides|Incomplete_Rides_Reason|    Booking_Value|Payment_Method|     Ride_Distance|    Driver_Ratings|   Customer_Rating|Vehicle Images|_c20|
+-------+------------------+--------+-------------+--------------------+-----------+------------+---------------+-------------+------------------+------------------+--------------------------+------------------------+----------------+-----------------------+----

                                                                                

# Validação e Integridade

O dado bruto no arquivo CSV possui viergulas no final de cada linha, até mesmo no header, isso provocou a aparição de uma nova coluna que foi nomeada para `_c20`, que não contem nenhum tipo de dado, apenas null. Essa coluna devera ser removida durante o tratamento dos dados.

Existe uma coluna que não possui nenhum valor útil para a Análise, sendo a coluna `Vehicle Images`, isso por que como pode-se notar, ela so carrega um valor, sendo p `#NAME?`. E mesmo que carregasse um valor diferente, para a análise pretendia, imagens não seria uteis, por isso essa coluna também devera sair durante o tratamento de dadados.

In [6]:
df_raw.describe(["Vehicle Images", "_c20"]).show()

26/01/17 23:47:56 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Vehicle Images, null
 Schema: Vehicle Images, _c20
Expected: _c20 but found: null
CSV file: file:///home/daniel-barros/Documentos/Faculdade/Sistemas%20de%20Banco%20de%20Dados%202%20-%20Thiago%20Luiz/car_rides_analytics/Data%20Layer/raw/dados_brutos.csv


+-------+--------------+----+
|summary|Vehicle Images|_c20|
+-------+--------------+----+
|  count|        103024|   0|
|   mean|          NULL|NULL|
| stddev|          NULL|NULL|
|    min|        #NAME?|NULL|
|    max|        #NAME?|NULL|
+-------+--------------+----+



In [7]:
df_raw.select(["Vehicle Images", "_c20"]).distinct().show()
df_raw.groupBy(["Vehicle Images", "_c20"]).count().orderBy(F.desc("count")).show()

26/01/17 23:47:57 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Vehicle Images, null
 Schema: Vehicle Images, _c20
Expected: _c20 but found: null
CSV file: file:///home/daniel-barros/Documentos/Faculdade/Sistemas%20de%20Banco%20de%20Dados%202%20-%20Thiago%20Luiz/car_rides_analytics/Data%20Layer/raw/dados_brutos.csv


+--------------+----+
|Vehicle Images|_c20|
+--------------+----+
|        #NAME?|NULL|
+--------------+----+



26/01/17 23:47:57 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Vehicle Images, null
 Schema: Vehicle Images, _c20
Expected: _c20 but found: null
CSV file: file:///home/daniel-barros/Documentos/Faculdade/Sistemas%20de%20Banco%20de%20Dados%202%20-%20Thiago%20Luiz/car_rides_analytics/Data%20Layer/raw/dados_brutos.csv


+--------------+----+------+
|Vehicle Images|_c20| count|
+--------------+----+------+
|        #NAME?|NULL|103024|
+--------------+----+------+



Verificação de dados Faltantes por coluna