# **Data Analysis**

## **Descriptive Data Analysis**

## **0. Imports**

In [43]:
import pandas as pd
import numpy as np
import sqlite3
from sqlalchemy import create_engine

### **0.1. Data Collection** 

In [23]:
path = '/home/eron/repos/SalesPricePredict/'
database_name = 'database_hm.sqlite'
conn = create_engine('sqlite:///' + path + database_name, echo = False)

In [24]:
query = """
    SELECT * FROM vitrine
"""

In [25]:
df_raw = pd.read_sql(query, con = conn)

## **1. Data Description**

In [28]:
df01 = df_raw.copy()

### **1.1. Data Dimension**

In [31]:
print('Number of rows: {}'.format(df01.shape[0]))
print('Number of rows: {}'.format(df01.shape[1]))

Number of rows: 916
Number of rows: 14


### **1.2. Data Types**

In [32]:
df01.dtypes

product_id          object
style_id            object
color_id            object
product_name        object
color_name          object
fit                 object
product_price      float64
size_number         object
size_model          object
cotton             float64
polyester          float64
elastane           float64
elasterell         float64
scrapy_datetime     object
dtype: object

In [35]:
# Convert object to datetime
df01['scrapy_datetime'] = pd.to_datetime(df01['scrapy_datetime'])

### **1.3. Missing Data**

In [37]:
df01.isna().sum()

product_id           0
style_id             0
color_id             0
product_name         0
color_name           0
fit                  0
product_price        0
size_number        721
size_model         685
cotton               0
polyester            0
elastane             0
elasterell           0
scrapy_datetime      0
dtype: int64

In [38]:
df01.isna().sum() / df01.shape[0]

product_id         0.000000
style_id           0.000000
color_id           0.000000
product_name       0.000000
color_name         0.000000
fit                0.000000
product_price      0.000000
size_number        0.787118
size_model         0.747817
cotton             0.000000
polyester          0.000000
elastane           0.000000
elasterell         0.000000
scrapy_datetime    0.000000
dtype: float64

### **1.4. Solving Missing Data**

After reviewing the ETL code and visiting the company's website, I noticed that the **size number** and **size model** data are not reported in the source, so I decided to **exclude the features**.

In [39]:
df01 = df01.drop(columns = ['size_number', 'size_model'])

### **1.5. Data Description**

In [42]:
num_attributes = df01.select_dtypes(include = ['float64', 'int64'])
cat_attributes = df01.select_dtypes(exclude = ['float64', 'int64', 'datetime64[ns]'])

#### **1.5.1. Numerical Data**

In [48]:
# Central tendency - mean and median
t1 = pd.DataFrame(num_attributes.apply(np.mean)).T
t2 = pd.DataFrame(num_attributes.apply(np.median)).T

# Dispersion - std, min, max, range, skew, kurtosis
d1 = pd.DataFrame(num_attributes.apply(np.std)).T
d2 = pd.DataFrame(num_attributes.apply(np.min)).T
d3 = pd.DataFrame(num_attributes.apply(np.max)).T
d4 = pd.DataFrame(num_attributes.apply(lambda x: x.max() - x.min())).T
d5 = pd.DataFrame(num_attributes.apply(lambda x: x.skew())).T
d6 = pd.DataFrame(num_attributes.apply(lambda x: x.kurtosis())).T

# Concat
m1 = pd.concat([d2, d3, d4, t1, t2, d1, d5, d6]).T.reset_index()
m1.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
m1

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,product_price,5.99,49.99,44.0,26.313144,24.99,9.909875,0.568862,-0.216318
1,cotton,0.66,1.0,0.34,0.962969,0.98,0.063127,-2.4391,5.463768
2,polyester,0.0,1.0,1.0,0.175404,0.0,0.284468,1.384655,0.422636
3,elastane,0.0,0.02,0.02,0.005502,0.0,0.007935,0.983963,-0.702336
4,elasterell,0.0,0.09,0.09,0.005524,0.0,0.020332,3.416728,9.705184


#### **1.5.2. Categorical Data**

### **2. Feature Engineering**

### **3. Data Filtering**

### **4. EDA - Exploratory Data Analysis**