# Projeto Final - Olist

## Descrição

"Conjunto de dados públicos de comércio eletrônico brasileiro por Olist
O conjunto de dados tem informações de 100 mil pedidos de 2016 a 2018 feitos em vários marketplaces no Brasil.

A Olist conecta pequenas empresas de todo o Brasil a canais sem complicações e com um único contrato. Esses comerciantes podem vender seus produtos através da Olist Store e enviá-los diretamente aos clientes usando os parceiros de logística da Olist.

Depois que um cliente compra o produto da Olist Store, um vendedor é notificado para atender esse pedido. Assim que o cliente recebe o produto, ou vence a data prevista de entrega, o cliente recebe uma pesquisa de satisfação por e-mail onde pode dar uma nota da experiência de compra e anotar alguns comentários."


## Importando Bibliotecas

In [1]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import glob
import pandas as pd

# Descrição dos dados

Each feature or columns of different csv files are described below:

* The  `olist_customers_dataset.csv` contain following features:

Feature | Description 
----------|---------------
**customer_id** | Id of the consumer who made the purchase.
**customer_unique_id**    | Unique Id of the consumer.
**customer_zip_code_prefix** | Zip Code of the location of the consumer.
**customer_city** | Name of the City from where order is made.
**customer_state** |  State Code from where order is made(Ex- sao paulo-SP).

* The `olist_sellers_dataset.csv` contains following features:

Feature | Description 
----------|---------------
**seller_id** |   Unique Id of the seller registered in olist.
**seller_zip_code_prefix** | Zip Code of the location of the seller.
**seller_city** | Name of the City of the seller.
**seller_state** | State Code (Ex- sao paulo-SP)


* The `olist_order_items_dataset.csv`  contain following features:

Feature | Description 
----------|---------------
**order_id** | A unique id of order made by the consumers.
**order_item_id** | A Unique id given to each item ordered in the order.
**product_id** |A unique id given to each product available on the site.
**seller_id** | Unique Id of the seller registered in olist.
**shipping_limit_date** | The date before which shipping of the ordered    product must be completed.
**price** | Actual price of the products ordered .
**freight_value** | Price rate at which a product is delivered from one point to another. 

* The `olist_order_payments_dataset.csv` contain following features:

Feature | Description 
----------|---------------
**order_id** | A unique id of order made by the consumers.
**payment_sequential** | sequences of the payments made in case of EMI.
**payment_type** |  mode of payment used.(Ex-Credit Card)
**payment_installments** | number of installments in case of EMI purchase.
**payment_value** | Total amount paid for the purshase order.



* The `olist_orders_dataset.csv`  contain following features:

Feature | Description 
----------|---------------
**order_id** | A unique id of order made by the consumers.
**customer_id** | Id of the consumer who made the purchase.
**order_status** | status of the order made i.e delivered, shipped etc.
**order_purchase_timestamp** | Timestamp of the purchase.
**order_approved_at** | Timestamp of the order approval.
**order_delivered_carrier_date** | delivery date at which carrier made the delivery.
**order_delivered_customer_date** | date at which customer got the product.
**order_estimated_delivery_date** | estimated delivery date of the products.


* The `olist_order_reviews_dataset.csv`  contain following features:

Feature | Description 
----------|---------------
**review_id** |Id of the review given on the product ordered by the order id.
**order_id** |  A unique id of order made by the consumers.
**review_score** | review score given by the customer for each order on the scale of 1–5. 
**review_comment_title** | Title of the review
**review_comment_message** | Review comments posted by the consumer for each order.
**review_creation_date** |Timestamp of the review when it is created.
**review_answer_timestamp** | Timestamp of the review answered.


* The `olist_products_dataset.csv` contain following features:

Feature | Description 
----------|---------------
**product_id** | A unique identifier for the proposed project.
**product_category_name** | Name of the product category
**product_name_lenght** | length of the string which specify the name given to the products ordered.
**product_description_lenght** | length of the description written for each product ordered on the site.
**product_photos_qty** | Number of photos of each product ordered available on the shopping portal.
**product_weight_g** | Weight of the products ordered in grams.
**product_length_cm** | Length of the products ordered in centimeters.
**product_height_cm** | Height of the products ordered in centimeters.
**product_width_cm** | width of the product ordered in centimeters.


# Leitura arquivos

In [8]:
itens = pd.read_csv("Data/itens.csv") # items
ordens = pd.read_csv("Data/ordens.csv") # order
produtos = pd.read_csv("Data/produtos.csv") # products
geolocal = pd.read_csv("Data/geolocal.csv") # geolocation
avaliacoes = pd.read_csv('Data/avaliacoes.csv') # reviews
clientes = pd.read_csv("Data/clientes.csv") # customers
pagamentos = pd.read_csv("Data/pagamentos.csv") # payments
vendedores = pd.read_csv("Data/vendedores.csv") # seller

In [3]:
!pip install pymysql



In [4]:
import pymysql

ModuleNotFoundError: No module named 'pymysql'

In [5]:
pip install pymysql==1.0.2

Collecting pymysql==1.0.2
  Using cached PyMySQL-1.0.2-py3-none-any.whl (43 kB)
Installing collected packages: pymysql
Successfully installed pymysql-1.0.2
Note: you may need to restart the kernel to use updated packages.


In [6]:
import sqlalchemy as db

db_server='pymysql'
user='root'
db_port = '3306'
password = 'andressa13'
ip = 'localhost'
db_name = 'olist'
engine = db.create_engine(f'mysql+{db_server}://{user}:{password}@{ip}:{db_port}/{db_name}?charset=utf8')
conn = engine.connect()


In [9]:
itens.to_sql('itens',con=conn, index=False, if_exists= 'replace')

112650

In [10]:
ordens.to_sql('ordens',con=conn, index=False, if_exists= 'replace')
produtos.to_sql('produtos',con=conn, index=False, if_exists= 'replace')
geolocal.to_sql('geolocal',con=conn, index=False, if_exists= 'replace')
avaliacoes.to_sql('avaliacoes',con=conn, index=False, if_exists= 'replace')
clientes.to_sql('clientes',con=conn, index=False, if_exists= 'replace')
pagamentos.to_sql('pagamentos',con=conn, index=False, if_exists= 'replace')
vendedores.to_sql('vendedores',con=conn, index=False, if_exists= 'replace')

3095

# Verificando dados - Atualizados

## Itens - items

* Tabela 'itens' com 112650 linhas e 8 colunas (order_id, product_id, seller_id,shipping_limit_date, price,freight_value  )

In [None]:
itens.shape

In [None]:
itens.info()

In [None]:
itens.isnull().sum()

In [None]:
itens.columns

## Ordens - order

Tabela 'ordens' tem 99441 linhas e 9 colunas ('order_id', 'customer_id', 'order_status', 'order_purchase_timestamp',
'order_approved_at', 'order_delivered_carrier_date',
'order_delivered_customer_date', 'order_estimated_delivery_date)


Nulos (order_approved_at - 160, order_delivered_carrier_date -  1783, order_delivered_customer_date- 2965)

In [None]:
ordens.shape

In [None]:
ordens.info()

In [None]:
ordens.isnull().sum()

In [None]:
ordens.columns

## Produtos - products

Tabela com 32340 linhas com 6 colunas

In [None]:
produtos.shape

In [None]:
produtos.info()

In [None]:
produtos.isnull().sum()

In [None]:
produtos.columns

In [None]:
produtos['product_category_name'].value_counts().head(10)

## Avaliações - reviews

Tabela 'avaliacoes' tem 99224 linhas e 7 colunas (review_id,order_id, review_score, 
review_comment_title, review_comment_message, review_creation_date, review_answer_timestamp)

In [None]:
avaliacoes.shape

In [None]:
avaliacoes.info()

In [None]:
avaliacoes.isnull().sum()

In [None]:
avaliacoes.columns

## Vendedores - Seller

Tabela de vendedores 3095 linhas  5 colunas ('seller_id', 'seller_zip_code_prefix', 'seller_city', 'seller_state')

In [None]:
vendedores.shape

In [None]:
vendedores.info()

In [None]:
vendedores.isnull().sum()

In [None]:
vendedores.columns

## Pagamentos - payments

Tabela de pagamentos 103886 linhas  5 colunas ('order_id', 'payment_sequential', 'payment_type','payment_installments', 'payment_value')

In [None]:
pagamentos.shape

In [None]:
pagamentos.info()

In [None]:
pagamentos.isnull().sum()

In [None]:
pagamentos.columns

In [None]:
pagamentos['payment_type'].value_counts()

## Clientes - customers

Tabela de clientes 99441 linhas  5 colunas ('customer_id', 'customer_unique_id', 'customer_zip_code_prefix','customer_city', 'customer_state')

In [None]:
clientes.shape

In [None]:
clientes.info()

In [None]:
clientes.isnull().sum()

In [None]:
clientes.columns

### Cidade

In [None]:
clientes.customer_city.unique()

### Estado

In [None]:
clientes.customer_state.unique()

In [None]:
clientes.groupby('customer_city').count()['customer_id'].reset_index()

## Geocalização

In [None]:
geolocal.shape

In [None]:
geolocal.info()

In [None]:
clientes.isnull().sum()

In [None]:
geolocal.columns

In [None]:
geolocal['geolocation_city'].value_counts()

# Analisando dados ausentes - DataFrame original

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing.head()

## Transformando Datetime (ordens)

In [None]:
times_cols = ['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_estimated_delivery_date', 'order_delivered_customer_date']
for col in times_cols:
    ordens[col] = pd.to_datetime(ordens[col])


## Fazendo Drop colunas

Opção para drop em colunas com percentual de numeros ausentes superior a 97% dos dados

In [None]:
df.drop(['seller_state', 'seller_city', 'seller_zip_code_prefix', 'review_comment_title', 'product_photos_qty'],axis=1, inplace=True)

Dropando as colunas 'review_comment_title', 'review_comment_message'

In [None]:
avaliacoes.drop(['review_comment_title', 'review_comment_message'],axis=1, inplace=True)

'''Após a limpeza o dataframe possui 99224 linhas e 5 colunas (review_id,order_id,review_score,
 review_creation_date,review_answer_timestamp) '''   

## Fazendo Drop Duplicates

In [None]:
geolocal.drop_duplicates(inplace=True)

## Dropando linhas em nulo

In [None]:
produtos.dropna(inplace=True)

In [None]:
produtos.drop(['product_name_lenght', 'product_description_lenght', 'product_photos_qty'],axis=1, inplace=True)

## Tratando linhas nulas(ordens)

Convertendo linhas nulas por informações próximas

In [None]:
ordens["order_approved_at"].fillna(ordens["order_purchase_timestamp"], inplace=True)
ordens["order_delivered_customer_date"].fillna(ordens["order_estimated_delivery_date"], inplace=True)
ordens["order_delivered_carrier_date"].fillna(ordens["order_delivered_customer_date"], inplace=True)

In [None]:
ordens.info()

## Extraindo atributos para data de compra - Ano e Mês

Abaixo converteremos um  datetime objeto contendo data e hora atuais para diferentes formatos.

In [None]:
ordens['order_purchase_year'] = ordens['order_purchase_timestamp'].apply(lambda x: x.year)
ordens['order_purchase_month'] = ordens['order_purchase_timestamp'].apply(lambda x: x.month)
ordens['order_purchase_month_name'] = ordens['order_purchase_timestamp'].apply(lambda x: x.strftime('%b') if x==x else x )
ordens['order_purchase_year_month'] = ordens['order_purchase_timestamp'].apply(lambda x: x.strftime('%Y%m') if x==x else x )
ordens['order_purchase_date'] = ordens['order_purchase_timestamp'].apply(lambda x: x.strftime('%Y%m%d')if x==x else x )

In [None]:
ordens.head(2)

## Gráficos

In [None]:
plt.figure(figsize=(10,6))
sns.set_style("whitegrid")
ax = clientes.customer_state.value_counts().sort_values(ascending=False)[0:10].plot(kind='bar', color = 'grey', alpha=0.8)
ax.set_title("Top 10 - Estados consumidores do Brasil")
ax.set_xlabel("States")
plt.xticks(rotation=35)
ax.set_ylabel("Nº de consumidores")
plt.show()

## Analisando crescimento por periodo

In [None]:
total_orders_month_year = ordens.groupby('order_purchase_year_month')['order_id'].nunique().reset_index()
plt.figure(figsize = (18,7))
sns.barplot(data = total_orders_month_year,
             x = 'order_purchase_year_month',
             y = 'order_id')
plt.suptitle("Total orders by year_month")

O número de pedidos começou a aumentar acentuadamente a partir de novembro de 2017, descobriremos por que há um aumento tão repentino

### Verificando as categorias por periodo

In [None]:
ordens[ordens['order_purchase_year'] == 2017]['product_category_name'].nunique()

In [None]:
ordens

In [None]:
order_delivered_customer_date_y


In [None]:
order_estimated_delivery_date_y


## Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
import sklearn
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

## Localização

In [None]:
from opencage.geocoder import OpenCageGeocode
