<a href="https://colab.research.google.com/github/andriellymoraespereira/analise-de-visitas-em-ecommerce/blob/main/01_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"andrielly","key":"7dd932a3b2f2dbb561692b726ebb7f5d"}'}

In [2]:
import os

# Criar pasta .kaggle
!mkdir -p ~/.kaggle

# Mover o kaggle.json para a pasta .kaggle
!mv kaggle.json ~/.kaggle/

# Definir permissões adequadas para o arquivo
!chmod 600 ~/.kaggle/kaggle.json


In [3]:
# Baixar o dataset específico do Kaggle
!kaggle datasets download -d mkechinov/ecommerce-behavior-data-from-multi-category-store

Dataset URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
License(s): copyright-authors
Downloading ecommerce-behavior-data-from-multi-category-store.zip to /content
100% 4.28G/4.29G [00:53<00:00, 119MB/s]
100% 4.29G/4.29G [00:53<00:00, 85.6MB/s]


In [4]:
# Descompactar o arquivo
import zipfile

# Descompactar o arquivo ZIP
with zipfile.ZipFile('ecommerce-behavior-data-from-multi-category-store.zip', 'r') as zip_ref:
    zip_ref.extractall('ecommerce_data')

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Carregar um subset dos dados
file_path = 'ecommerce_data/2019-Nov.csv'
chunk_size = 1000000
chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunk_size):
    chunks.append(chunk)
    if len(chunks) >= 5:  # Carregar mais registros (5 chunks)
        break

df = pd.concat(chunks, axis=0)

In [6]:
# Verificar valores ausentes
print("Valores ausentes antes do preenchimento:\n", df.isnull().sum())

Valores ausentes antes do preenchimento:
 event_time             0
event_type             0
product_id             0
category_id            0
category_code    1602394
brand             727867
price                  0
user_id                0
user_session           0
dtype: int64


In [7]:
# Preencher valores ausentes nas colunas category_code e brand
df['category_code'].fillna('Unknown', inplace=True)
df['brand'].fillna('Unknown', inplace=True)

In [8]:
# Verificar e remover duplicatas baseadas em colunas que indicam o mesmo evento
print(f"Número de linhas antes da remoção de duplicatas: {df.shape[0]}")
df_cleaned = df.drop_duplicates(subset=['user_id', 'event_time', 'category_code', 'brand', 'price'])
print(f"Número de linhas após a remoção de duplicatas: {df_cleaned.shape[0]}")

Número de linhas antes da remoção de duplicatas: 5000000
Número de linhas após a remoção de duplicatas: 4997407


In [9]:
# Verificar se ainda há valores ausentes
print("Valores ausentes após o preenchimento e remoção de duplicatas:\n", df_cleaned.isnull().sum())

Valores ausentes após o preenchimento e remoção de duplicatas:
 event_time       0
event_type       0
product_id       0
category_id      0
category_code    0
brand            0
price            0
user_id          0
user_session     0
dtype: int64


In [10]:
# Corrigir tipos de dados, se necessário (exemplo: converter para datetime)
df_cleaned['event_time'] = pd.to_datetime(df_cleaned['event_time'])
if 'start_time' in df_cleaned.columns:
    df_cleaned['start_time'] = pd.to_datetime(df_cleaned['start_time'])
if 'end_time' in df_cleaned.columns:
    df_cleaned['end_time'] = pd.to_datetime(df_cleaned['end_time'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['event_time'] = pd.to_datetime(df_cleaned['event_time'])


In [11]:
# Exemplo: verificar valores negativos em colunas que não deveriam tê-los
df_cleaned = df_cleaned[df_cleaned['price'] >= 0]

In [12]:
# Salvar o conjunto de dados limpo
df_cleaned.to_csv('cleaned_data.csv', index=False)

print("Dados limpos e salvos com sucesso.")

Dados limpos e salvos com sucesso.


# Salvar o arquivo no Google Drive

In [31]:
from google.colab import drive
drive.mount('/content/drive')

# Salvar o arquivo CSV no Google Drive
!cp cleaned_data.csv /content/drive/MyDrive/


Mounted at /content/drive


# Enviar o Datase para o gitHub

Como o dataset excede o limite de tamanho de 100 MB do GitHub.  Para lidar com arquivos grandes, vamos usar o Git Large File Storage (Git LFS).

In [27]:
# Instalar o Git LFS
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install


Detected operating system as Ubuntu/jammy.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Detected apt version as 2.4.12
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... Packagecloud gpg key imported to /etc/apt/keyrings/github_git-lfs-archive-keyring.gpg
done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.5.1).
0 upgraded, 0 newly installed, 0 to remove and 50 not upgraded.
Updated Git hooks.
Git LFS initialized.


In [28]:
# Clone o repositório do GitHub para o Google Colab

from getpass import getpass

# Peça ao usuário o token de acesso pessoal do GitHub
token = getpass('Enter your GitHub token: ')

# Clone o repositório usando o token
!git clone https://{token}@github.com/andriellymoraespereira/analise-de-visitas-em-ecommerce.git

Enter your GitHub token: ··········
Cloning into 'analise-de-visitas-em-ecommerce'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (6/6), 6.73 KiB | 430.00 KiB/s, done.


In [29]:
# Configurar o Git LFS para o Arquivo Grande
import os

# Navegue até o diretório do repositório clonado
os.chdir('analise-de-visitas-em-ecommerce')

# Configurar o Git LFS para rastrear o arquivo CSV
!git lfs track "cleaned_data.csv"


Tracking "cleaned_data.csv"


In [30]:
# Mover o Arquivo, Fazer Commit e Enviar para o GitHub
import shutil

# Move o arquivo cleaned_data.csv para o diretório do repositório clonado
shutil.move('../cleaned_data.csv', '.')

# Adicione o arquivo ao repositório
!git add .gitattributes cleaned_data.csv

# Faça commit do arquivo
!git commit -m "Add cleaned data using Git LFS"

# Envie para o repositório no GitHub
!git push origin main


[main 7debe65] Add cleaned data using Git LFS
 2 files changed, 4 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 cleaned_data.csv
Uploading LFS objects: 100% (1/1), 693 MB | 71 MB/s, done.
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 554 bytes | 554.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/andriellymoraespereira/analise-de-visitas-em-ecommerce.git
   0ee3729..7debe65  main -> main
