## Análise de dados: Comércio eletrônico brasileiro

Este projeto é composto por um conjunto de dados públicos de comércio eletrônico brasileiro, disponibilizados pelo site Olist, são registros que compõem todo o processo de venda de um produto, da compra, pagamento, entrega e avaliação, além de dados de geolocalização, produtos e vendedores. Estas informações serão tratadas e analisadas, de modo a responder questões de negócio.


### Importação de bibliotecas

In [74]:
from pyspark.sql import SparkSession


### Criação e iniciação de uma sessão Spark

In [75]:
spark = SparkSession.builder.appName('PySpark - Olist').getOrCreate()
spark


### Criação dos datasets a partir da leitura dos arquivos *.csv


In [76]:
df_orders = spark.read.csv('dados\olist_orders_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_customers = spark.read.csv('dados\olist_customers_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_geolocation = spark.read.csv('dados\olist_geolocation_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_order_items = spark.read.csv('dados\olist_order_items_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_order_payments = spark.read.csv('dados\olist_order_payments_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_order_reviews = spark.read.csv('dados\olist_order_reviews_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_products = spark.read.csv('dados\olist_products_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)
df_sellers = spark.read.csv('dados\olist_sellers_dataset.csv', sep=',', header=True, encoding='utf-8', inferSchema=True)

### Verificando os tipos das colunas

In [77]:
df_orders.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)



In [78]:
df_customers.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)



In [79]:
df_geolocation.printSchema()

root
 |-- geolocation_zip_code_prefix: integer (nullable = true)
 |-- geolocation_lat: double (nullable = true)
 |-- geolocation_lng: double (nullable = true)
 |-- geolocation_city: string (nullable = true)
 |-- geolocation_state: string (nullable = true)



In [80]:
df_order_items.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- shipping_limit_date: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)



In [81]:
df_order_payments.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable = true)
 |-- payment_value: double (nullable = true)



In [82]:
df_order_reviews.printSchema()

root
 |-- review_id: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- review_score: string (nullable = true)
 |-- review_comment_title: string (nullable = true)
 |-- review_comment_message: string (nullable = true)
 |-- review_creation_date: string (nullable = true)
 |-- review_answer_timestamp: string (nullable = true)



In [83]:
df_products.printSchema()

root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_description_lenght: integer (nullable = true)
 |-- product_photos_qty: integer (nullable = true)
 |-- product_weight_g: integer (nullable = true)
 |-- product_length_cm: integer (nullable = true)
 |-- product_height_cm: integer (nullable = true)
 |-- product_width_cm: integer (nullable = true)



In [84]:
df_sellers.printSchema()

root
 |-- seller_id: string (nullable = true)
 |-- seller_zip_code_prefix: integer (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- seller_state: string (nullable = true)



### Verificando a existência de registros nulos

In [85]:
def check_nulls(dataframe, name):
    print('-'*100)
    print(name.upper(), '\n')
    for coluna in dataframe.columns:
        print(coluna, dataframe.filter(
            dataframe[coluna].isNull()).count())


In [88]:
check_nulls(df_orders, 'df_orders')
check_nulls(df_customers, 'df_customers')
check_nulls(df_geolocation, 'df_geolocation')
check_nulls(df_order_items, 'df_order_items')
check_nulls(df_order_payments, 'df_order_payments')
check_nulls(df_order_reviews, 'df_order_reviews')
check_nulls(df_products, 'df_products')
check_nulls(df_sellers, 'df_sellers')


----------------------------------------------------------------------------------------------------
DF_ORDERS 

order_id 0
customer_id 0
order_status 0
order_purchase_timestamp 0
order_approved_at 160
order_delivered_carrier_date 1783
order_delivered_customer_date 2965
order_estimated_delivery_date 0
----------------------------------------------------------------------------------------------------
DF_CUSTOMERS 

customer_id 0
customer_unique_id 0
customer_zip_code_prefix 0
customer_city 0
customer_state 0
----------------------------------------------------------------------------------------------------
DF_GEOLOCATION 

geolocation_zip_code_prefix 0
geolocation_lat 0
geolocation_lng 0
geolocation_city 0
geolocation_state 0
----------------------------------------------------------------------------------------------------
DF_ORDER_ITEMS 

order_id 0
order_item_id 0
product_id 0
seller_id 0
shipping_limit_date 0
price 0
freight_value 0
-----------------------------------------------

Foram identificados valores nulos em 3 dataframes, df_orders, df_order_reviews e df_products, entretando dos dados representam operações de venda, logo possui vários estágios podendo ser uma venda concluída, cancelada, processamento ou mesmo em trânsito, ou seja, dependendo do estágio algumas colunas podem ficarem vazias (nulas).

### Criando views temporárias para uso do Spark SQL

In [89]:
df_orders.createOrReplaceTempView('orders')
df_customers.createOrReplaceTempView('customers')
df_geolocation.createOrReplaceTempView('geolocation')
df_order_items.createOrReplaceTempView('order_items')
df_order_payments.createOrReplaceTempView('order_payments')
df_order_reviews.createOrReplaceTempView('order_reviews')
df_products.createOrReplaceTempView('products')
df_sellers.createOrReplaceTempView('sellers')
