## Módulo: Analytics Engineering

## Aula 2 - Parte 1

### Programação da Aula 2:

> ### 1. **Princípios de qualidade de dados e seus benefícios**:
> ### 2. **Qualidade de dados na prática**;
> ### 3. **Desenvolvimento de exercício**;

#### Link para o formulário de Feedback da aula:
https://forms.gle/pYhxH5MMAZYPCwvVA

### **Princípios da qualidade de dados**

- **Precisão:** Os dados precisos são dados que estão corretos e livres de erros. Eles são baseados em fontes confiáveis e são atualizados regularmente. A precisão dos dados pode ser afetada por erros de entrada manual, integração de dados de várias fontes, omissão de dados e duplicação de dados. Para garantir a precisão dos dados, é importante realizar verificações regulares dos dados para identificar e corrigir erros.

- **Consistência:** Os dados consistentes são dados que estão no mesmo formato e mantêm as mesmas definições em todos os sistemas e processos. A consistência dos dados pode ser afetada pela falta de padronização dos dados, pela variação dos nomes de campos e pela falta de integridade referencial. Para garantir a consistência dos dados, é importante estabelecer padrões para a entrada de dados e verificar regularmente se esses padrões estão sendo seguidos.

- **Confiabilidade:** Os dados confiáveis são dados que são precisos e consistentes, mas também são acessíveis e seguros. Eles são protegidos contra perda, roubo e corrupção, e podem ser acessados por usuários autorizados. A confiabilidade dos dados pode ser afetada pela falta de segurança de dados, pela falta de backups regulares e pela falta de controle de acesso aos dados. Para garantir a confiabilidade dos dados, é importante implementar práticas de segurança de dados, como backups regulares e controles de acesso.

- **Relevância:** Os dados relevantes são dados que são necessários para as decisões empresariais. Eles são coletados com base nos requisitos do negócio e são relevantes para os usuários finais. A relevância dos dados pode ser afetada pela coleta de dados desnecessários ou pela falta de coleta de dados importantes. Para garantir a relevância dos dados, é importante estabelecer um processo para identificar os requisitos de dados do negócio e coletar apenas os dados necessários.


### **Benefícios da qualidade de dados**

- **Tomada de decisão mais informada:** Dados precisos, consistentes, confiáveis e relevantes permitem que as empresas tomem decisões informadas com base em informações precisas e confiáveis.

- **Redução de custos:** Dados de baixa qualidade podem levar a decisões equivocadas, o que pode levar a custos adicionais para a empresa. Ao melhorar a qualidade dos dados, as empresas podem reduzir os custos desnecessários associados à tomada de decisão equivocada.

- **Melhoria da eficiência:** Dados precisos e consistentes podem ajudar a melhorar a eficiência dos processos empresariais, reduzindo o tempo gasto na correção de erros e retrabalho.

- **Aumento da satisfação do cliente:** Dados precisos e relevantes permitem que as empresas entendam melhor as necessidades de seus clientes e ofereçam soluções personalizadas para atender a essas necessidades. Isso pode levar a um aumento na satisfação do cliente e na fidelidade à marca.



### **Qualidade de dados na prática**

### Instalação das biblioteca para verificação do perfil dos dados

In [1]:
pip install ydata-profiling

Note: you may need to restart the kernel to use updated packages.




### Chamada da bibliotecas

In [2]:
import pandas as pd
from ydata_profiling import ProfileReport

### Dataset sobre preço de carros usados:
https://data.world/data-society/used-cars-data

In [4]:
df = pd.read_csv("F:\ADA\AnalyticsEngineer\lucio\dados\\autos.csv", encoding='ISO-8859-1')
df.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46


In [5]:
!pip install ipython-sql
!pip install sqlalchemy
!pip install psycopg2



In [3]:
from sqlalchemy import create_engine, text as sql_text

In [4]:
engine = create_engine('postgresql://postgres:ada@localhost/ada')

In [9]:
df.to_sql('base_autos_raw', engine, if_exists='replace', index=False)

528

### Comando "describe" que retorna algumas informações do DataFrame

In [11]:
df.describe()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,monthOfRegistration,nrOfPictures,postalCode
count,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0
mean,17295.14,2004.577997,115.549477,125618.688228,5.734445,0.0,50820.66764
std,3587954.0,92.866598,192.139578,40112.337051,3.712412,0.0,25799.08247
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1150.0,1999.0,70.0,125000.0,3.0,0.0,30459.0
50%,2950.0,2003.0,105.0,150000.0,6.0,0.0,49610.0
75%,7200.0,2008.0,150.0,150000.0,9.0,0.0,71546.0
max,2147484000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


### Relatório com informações do perfil dos dados a partir da biblioteca "ydata_profiling"

In [12]:
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_file("F:\ADA\AnalyticsEngineer\lucio\dados\\resultados-1.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Tipos de cada coluna no Dataframe

In [16]:
df.dtypes

dateCrawled            object
name                   object
seller                 object
offerType              object
price                   int64
abtest                 object
vehicleType            object
yearOfRegistration      int64
gearbox                object
powerPS                 int64
model                  object
kilometer               int64
monthOfRegistration     int64
fuelType               object
brand                  object
notRepairedDamage      object
dateCreated            object
nrOfPictures            int64
postalCode              int64
lastSeen               object
dtype: object

### Cria um dataframe copia para começar a limpeza dos dados. O primeiro passo será definir o tipo das colunas que não serão de texto.

In [14]:
df_cln = df.copy()

In [15]:
list_datetime = ['dateCrawled', 'dateCreated', 'lastSeen'] #colunas do tipo de data e hora
for column in list_datetime: 
    #transforma a coluna para o tipo "datetime" com o formato passado
    df_cln[column] = pd.to_datetime(df_cln[column], format='%Y-%m-%d %H:%M:%S')

In [21]:
df_cln.dtypes

dateCrawled            datetime64[ns]
name                           object
seller                         object
offerType                      object
price                         float64
abtest                         object
vehicleType                    object
yearOfRegistration              int32
gearbox                        object
powerPS                       float64
model                          object
kilometer                     float64
monthOfRegistration             int32
fuelType                       object
brand                          object
notRepairedDamage              object
dateCreated            datetime64[ns]
nrOfPictures                    int32
postalCode                      int32
lastSeen               datetime64[ns]
dtype: object

In [18]:
list_int = ['yearOfRegistration', 'monthOfRegistration', 'nrOfPictures', 'postalCode'] #colunas do tipo inteiro
for column in list_int: 
    #transforma a coluna para o tipo inteiro
    df_cln[column] = df_cln[column].astype("int")

In [20]:
list_float = ['price', 'powerPS', 'kilometer'] #colunas do tipo float
for column in list_float: 
    #transforma a coluna para o tipo float
    df_cln[column] = df_cln[column].astype("float")

### Verifica os novos tipos de cada coluna

In [22]:
df_cln.dtypes

dateCrawled            datetime64[ns]
name                           object
seller                         object
offerType                      object
price                         float64
abtest                         object
vehicleType                    object
yearOfRegistration              int32
gearbox                        object
powerPS                       float64
model                          object
kilometer                     float64
monthOfRegistration             int32
fuelType                       object
brand                          object
notRepairedDamage              object
dateCreated            datetime64[ns]
nrOfPictures                    int32
postalCode                      int32
lastSeen               datetime64[ns]
dtype: object

### Função para verificar os casos nulos

In [26]:
def check_missing(df_param):
    res_missing = df_param.isna().sum()
    res_missing = (res_missing/len(df_param))*100
    return res_missing

In [27]:
check_missing(df_cln).sort_values(ascending=False)

notRepairedDamage      19.395577
vehicleType            10.192771
fuelType                8.986133
model                   5.513447
gearbox                 5.439429
kilometer               0.000000
postalCode              0.000000
nrOfPictures            0.000000
dateCreated             0.000000
brand                   0.000000
monthOfRegistration     0.000000
dateCrawled             0.000000
name                    0.000000
powerPS                 0.000000
yearOfRegistration      0.000000
abtest                  0.000000
price                   0.000000
offerType               0.000000
seller                  0.000000
lastSeen                0.000000
dtype: float64

### Preenchimento dos campos nulos com valores fixos 

In [28]:
df_cln['notRepairedDamage'].value_counts()

notRepairedDamage
nein    263182
ja       36286
Name: count, dtype: int64

In [29]:
df_cln['notRepairedDamage'] = df_cln['notRepairedDamage'].fillna("no_info")

In [30]:
df_cln['notRepairedDamage'].value_counts()

notRepairedDamage
nein       263182
no_info     72060
ja          36286
Name: count, dtype: int64

### Preenchimento dos campos nulos com valores fixos 

In [31]:
df_cln['vehicleType'].value_counts()

vehicleType
limousine     95894
kleinwagen    80023
kombi         67564
bus           30201
cabrio        22898
coupe         19015
suv           14707
andere         3357
Name: count, dtype: int64

In [32]:
df_cln['vehicleType'] = df_cln['vehicleType'].fillna("no_info")

In [33]:
df_cln['vehicleType'].value_counts()

vehicleType
limousine     95894
kleinwagen    80023
kombi         67564
no_info       37869
bus           30201
cabrio        22898
coupe         19015
suv           14707
andere         3357
Name: count, dtype: int64

### Preenchimento dos campos nulos com o campo que mais se repete

In [34]:
df_cln['fuelType'].value_counts()

fuelType
benzin     223857
diesel     107746
lpg          5378
cng           571
hybrid        278
andere        208
elektro       104
Name: count, dtype: int64

In [36]:
high_freq = df_cln['fuelType'].value_counts().idxmax()
df_cln['fuelType'] = df_cln['fuelType'].fillna(high_freq)

In [37]:
df_cln['fuelType'].value_counts()

fuelType
benzin     257243
diesel     107746
lpg          5378
cng           571
hybrid        278
andere        208
elektro       104
Name: count, dtype: int64

### Preenchimento dos campos nulos com valores fixos 

In [38]:
df_cln['model'].value_counts()

model
golf               30070
andere             26400
3er                20567
polo               13092
corsa              12573
                   ...  
serie_2                8
rangerover             6
serie_3                4
serie_1                2
discovery_sport        1
Name: count, Length: 251, dtype: int64

In [39]:
df_cln['model'] = df_cln['model'].fillna("no_info")

In [40]:
df_cln['model'].value_counts()

model
golf               30070
andere             26400
3er                20567
no_info            20484
polo               13092
                   ...  
serie_2                8
rangerover             6
serie_3                4
serie_1                2
discovery_sport        1
Name: count, Length: 252, dtype: int64

### Preenchimento dos campos nulos com valores fixos 

In [41]:

df_cln['gearbox'].value_counts()

gearbox
manuell      274214
automatik     77105
Name: count, dtype: int64

In [42]:
df_cln['gearbox'] = df_cln['gearbox'].fillna("no_info")

In [43]:
df_cln['gearbox'].value_counts()

gearbox
manuell      274214
automatik     77105
no_info       20209
Name: count, dtype: int64

### Verifica os resultados nulos após o tratamento

In [44]:
check_missing(df_cln).sort_values(ascending=False)

dateCrawled            0.0
name                   0.0
postalCode             0.0
nrOfPictures           0.0
dateCreated            0.0
notRepairedDamage      0.0
brand                  0.0
fuelType               0.0
monthOfRegistration    0.0
kilometer              0.0
model                  0.0
powerPS                0.0
gearbox                0.0
yearOfRegistration     0.0
vehicleType            0.0
abtest                 0.0
price                  0.0
offerType              0.0
seller                 0.0
lastSeen               0.0
dtype: float64

### Eliminando os campos duplicados

In [45]:
df_cln

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480.0,test,no_info,1993,manuell,0.0,golf,150000.0,0,benzin,volkswagen,no_info,2016-03-24,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300.0,test,coupe,2011,manuell,190.0,no_info,125000.0,5,diesel,audi,ja,2016-03-24,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800.0,test,suv,2004,automatik,163.0,grand,125000.0,8,diesel,jeep,no_info,2016-03-14,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500.0,test,kleinwagen,2001,manuell,75.0,golf,150000.0,6,benzin,volkswagen,nein,2016-03-17,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600.0,test,kleinwagen,2008,manuell,69.0,fabia,90000.0,7,diesel,skoda,nein,2016-03-31,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200.0,test,no_info,2005,no_info,0.0,no_info,20000.0,1,benzin,sonstige_autos,no_info,2016-03-14,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199.0,test,cabrio,2000,automatik,101.0,fortwo,125000.0,3,benzin,smart,nein,2016-03-05,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200.0,test,bus,1996,manuell,102.0,transporter,150000.0,3,diesel,volkswagen,nein,2016-03-19,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400.0,test,kombi,2002,manuell,100.0,golf,150000.0,6,diesel,volkswagen,no_info,2016-03-20,0,40764,2016-03-24 12:45:21


In [46]:
print("Antes: ", len(df_cln))
df_cln = df_cln.drop_duplicates()
print("Depois: ", len(df_cln))

Antes:  371528
Depois:  371524


### Eliminando as colunas constantes

In [47]:
list_constant = [col for col in df_cln.columns if df_cln[col].nunique() ==1]
list_constant

['nrOfPictures']

In [49]:
print("Antes: ", len(df_cln.columns))
df_cln = df_cln.drop(list_constant, axis=1)
print("Depois: ", len(df_cln.columns))

Antes:  20
Depois:  19


### Eliminando as colunas extremamente desbalanceadas

In [50]:
df_cln['offerType'].value_counts(normalize=True)

offerType
Angebot    0.999968
Gesuch     0.000032
Name: proportion, dtype: float64

In [54]:
df_cln['seller'].value_counts(normalize=True)

seller
privat        0.999992
gewerblich    0.000008
Name: proportion, dtype: float64

In [53]:
df_cln['offerType'].value_counts(normalize=True).values[0]

0.999991925151538

In [52]:
list_1 = []
limit = 0.98
for col in df_cln.columns:
    perc = df_cln[col].value_counts(normalize=True).values[0]
    if perc > limit:
        list_1.append(col)
        print(col, perc)

seller 0.999991925151538
offerType 0.9999677006061519


In [55]:
df_cln = df_cln.drop(list_1, axis=1)

In [56]:
df_cln

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,480.0,test,no_info,1993,manuell,0.0,golf,150000.0,0,benzin,volkswagen,no_info,2016-03-24,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,18300.0,test,coupe,2011,manuell,190.0,no_info,125000.0,5,diesel,audi,ja,2016-03-24,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",9800.0,test,suv,2004,automatik,163.0,grand,125000.0,8,diesel,jeep,no_info,2016-03-14,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,1500.0,test,kleinwagen,2001,manuell,75.0,golf,150000.0,6,benzin,volkswagen,nein,2016-03-17,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,3600.0,test,kleinwagen,2008,manuell,69.0,fabia,90000.0,7,diesel,skoda,nein,2016-03-31,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,2200.0,test,no_info,2005,no_info,0.0,no_info,20000.0,1,benzin,sonstige_autos,no_info,2016-03-14,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,1199.0,test,cabrio,2000,automatik,101.0,fortwo,125000.0,3,benzin,smart,nein,2016-03-05,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,9200.0,test,bus,1996,manuell,102.0,transporter,150000.0,3,diesel,volkswagen,nein,2016-03-19,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,3400.0,test,kombi,2002,manuell,100.0,golf,150000.0,6,diesel,volkswagen,no_info,2016-03-20,40764,2016-03-24 12:45:21


### Verificação dos resultados depois de eliminar as colunas extremamente desbalanceadas

In [58]:
list_1 = []
limit = 0.98
for col in df_cln.columns:
    perc = df_cln[col].value_counts(normalize=True).values[0]
    if perc > limit:
        list_1.append(col)
        print(col, perc)

### Verificação da precisão dos dados

### Os meses precisam estar no intervalo: 1 <= meses <= 12 

In [59]:
df_cln[(df_cln['monthOfRegistration']<1) | (df_cln['monthOfRegistration']>12)]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,480.0,test,no_info,1993,manuell,0.0,golf,150000.0,0,benzin,volkswagen,no_info,2016-03-24,70435,2016-04-07 03:16:57
9,2016-03-17 10:53:50,VW_Golf_4_5_tuerig_zu_verkaufen_mit_Anhaengerk...,999.0,test,kleinwagen,1998,manuell,101.0,golf,150000.0,0,benzin,volkswagen,no_info,2016-03-17,27472,2016-03-31 17:17:06
15,2016-03-11 21:39:15,KA_Lufthansa_Edition_450_VB,450.0,test,kleinwagen,1910,no_info,0.0,ka,5000.0,0,benzin,ford,no_info,2016-03-11,24148,2016-03-19 08:46:47
16,2016-04-01 12:46:46,Polo_6n_1_4,300.0,test,no_info,2016,no_info,60.0,polo,150000.0,0,benzin,volkswagen,no_info,2016-04-01,38871,2016-04-01 12:46:46
36,2016-03-11 11:50:37,Opel_Kadett_E_CC,1600.0,control,andere,1991,manuell,75.0,kadett,70000.0,0,benzin,opel,no_info,2016-03-11,2943,2016-04-07 03:46:09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371460,2016-04-03 13:46:24,Polo_g40_auch_Tausch_vag...no_vr6_gti_1.8t,3500.0,control,no_info,1995,no_info,0.0,polo,150000.0,0,benzin,volkswagen,no_info,2016-04-03,74579,2016-04-05 12:44:38
371473,2016-03-15 19:57:11,Subaru_Allrad,400.0,control,kombi,1991,manuell,0.0,legacy,150000.0,0,benzin,subaru,no_info,2016-03-15,24558,2016-03-19 15:49:00
371482,2016-03-31 19:36:18,Peugeot_206,1300.0,control,kleinwagen,1999,manuell,75.0,2_reihe,125000.0,0,benzin,peugeot,no_info,2016-03-31,35102,2016-04-06 13:44:44
371486,2016-03-30 20:55:30,Zu_verkaufen,350.0,control,kleinwagen,1996,no_info,65.0,punto,150000.0,0,benzin,fiat,no_info,2016-03-30,25436,2016-04-07 13:50:41


In [60]:
df_cln.loc[(df_cln['monthOfRegistration']<1) | (df_cln['monthOfRegistration']>12), 'monthOfRegistration'] = -1

In [61]:
df_cln[(df_cln['monthOfRegistration']<1) | (df_cln['monthOfRegistration']>12)]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,480.0,test,no_info,1993,manuell,0.0,golf,150000.0,-1,benzin,volkswagen,no_info,2016-03-24,70435,2016-04-07 03:16:57
9,2016-03-17 10:53:50,VW_Golf_4_5_tuerig_zu_verkaufen_mit_Anhaengerk...,999.0,test,kleinwagen,1998,manuell,101.0,golf,150000.0,-1,benzin,volkswagen,no_info,2016-03-17,27472,2016-03-31 17:17:06
15,2016-03-11 21:39:15,KA_Lufthansa_Edition_450_VB,450.0,test,kleinwagen,1910,no_info,0.0,ka,5000.0,-1,benzin,ford,no_info,2016-03-11,24148,2016-03-19 08:46:47
16,2016-04-01 12:46:46,Polo_6n_1_4,300.0,test,no_info,2016,no_info,60.0,polo,150000.0,-1,benzin,volkswagen,no_info,2016-04-01,38871,2016-04-01 12:46:46
36,2016-03-11 11:50:37,Opel_Kadett_E_CC,1600.0,control,andere,1991,manuell,75.0,kadett,70000.0,-1,benzin,opel,no_info,2016-03-11,2943,2016-04-07 03:46:09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371460,2016-04-03 13:46:24,Polo_g40_auch_Tausch_vag...no_vr6_gti_1.8t,3500.0,control,no_info,1995,no_info,0.0,polo,150000.0,-1,benzin,volkswagen,no_info,2016-04-03,74579,2016-04-05 12:44:38
371473,2016-03-15 19:57:11,Subaru_Allrad,400.0,control,kombi,1991,manuell,0.0,legacy,150000.0,-1,benzin,subaru,no_info,2016-03-15,24558,2016-03-19 15:49:00
371482,2016-03-31 19:36:18,Peugeot_206,1300.0,control,kleinwagen,1999,manuell,75.0,2_reihe,125000.0,-1,benzin,peugeot,no_info,2016-03-31,35102,2016-04-06 13:44:44
371486,2016-03-30 20:55:30,Zu_verkaufen,350.0,control,kleinwagen,1996,no_info,65.0,punto,150000.0,-1,benzin,fiat,no_info,2016-03-30,25436,2016-04-07 13:50:41


### Verificação do resultado 

In [62]:
df_cln[((df_cln['monthOfRegistration']<1) | (df_cln['monthOfRegistration']>12)) & (df_cln['monthOfRegistration']!=-1)]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen


### O Ano precisam estar no intervalo: 1900 <= ano <= 2016 

In [64]:
df_cln.loc[(df_cln['yearOfRegistration']<1900) | (df_cln['yearOfRegistration']>2016), 'yearOfRegistration'] = 1900

### Verificação do resultado 

In [65]:
df_cln[(df_cln['yearOfRegistration']<1900) | (df_cln['yearOfRegistration']>2016)]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen


### Os preços precisam ser maiores que 0

In [68]:
df_cln[(df_cln['price']<=0) ]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
7,2016-03-21 18:54:38,VW_Derby_Bj_80__Scheunenfund,0.0,test,limousine,1980,manuell,50.0,andere,40000.0,7,benzin,volkswagen,nein,2016-03-21,19348,2016-03-25 16:47:58
40,2016-03-26 22:06:17,Suche_Opel_corsa_a_zu_verschenken,0.0,test,no_info,1990,no_info,0.0,corsa,150000.0,1,benzin,opel,no_info,2016-03-26,56412,2016-03-27 17:43:34
115,2016-03-19 18:40:12,Golf_IV_1.4_16V,0.0,test,no_info,1900,manuell,0.0,golf,5000.0,12,benzin,volkswagen,no_info,2016-03-19,21698,2016-04-01 08:47:05
119,2016-03-20 18:53:27,Polo_6n_Karosse_zu_verschenken,0.0,test,kleinwagen,1999,no_info,0.0,no_info,5000.0,-1,benzin,volkswagen,no_info,2016-03-20,37520,2016-04-07 02:45:22
157,2016-03-11 18:55:53,Opel_meriva_1.6_16_v_lpg__z16xe_no_OPC,0.0,test,bus,2004,manuell,101.0,meriva,150000.0,10,lpg,opel,ja,2016-03-11,27432,2016-03-12 23:47:10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371356,2016-03-09 15:56:30,Verkaufen_einen_Opel_corsa_b_worlcup_cool,0.0,control,no_info,2000,manuell,65.0,corsa,150000.0,-1,benzin,opel,ja,2016-03-09,23758,2016-03-30 11:16:08
371392,2016-03-20 14:55:07,Ford_Fiesta_1.3___60PS___Bj_2002___Klima___Servo,0.0,test,kleinwagen,2002,manuell,60.0,fiesta,150000.0,3,benzin,ford,no_info,2016-03-20,33659,2016-04-06 18:45:23
371402,2016-03-24 13:48:05,Suzuki_Swift_zu_verkaufen,0.0,control,kleinwagen,1999,manuell,53.0,swift,150000.0,3,benzin,suzuki,no_info,2016-03-24,42329,2016-04-07 05:17:24
371431,2016-03-10 22:55:50,Seat_Arosa,0.0,control,kleinwagen,1999,manuell,37.0,arosa,150000.0,7,benzin,seat,ja,2016-03-10,22559,2016-03-12 23:46:32


In [69]:
df_cln = df_cln[df_cln['price']>0]

In [6]:
df_cln[(df_cln['price']<=0) ]

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen


In [71]:
df_cln.to_sql('base_autos_raw_2', engine, if_exists='replace', index=False)

746

In [7]:
query = """
SELECT *
FROM base_autos_raw_2
"""
df_cln = pd.read_sql(sql=sql_text(query), con=engine.connect())
df_cln

Unnamed: 0,dateCrawled,name,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,480.0,test,no_info,1993,manuell,0.0,golf,150000.0,-1,benzin,volkswagen,no_info,2016-03-24,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,18300.0,test,coupe,2011,manuell,190.0,no_info,125000.0,5,diesel,audi,ja,2016-03-24,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",9800.0,test,suv,2004,automatik,163.0,grand,125000.0,8,diesel,jeep,no_info,2016-03-14,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,1500.0,test,kleinwagen,2001,manuell,75.0,golf,150000.0,6,benzin,volkswagen,nein,2016-03-17,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,3600.0,test,kleinwagen,2008,manuell,69.0,fabia,90000.0,7,diesel,skoda,nein,2016-03-31,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360741,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,2200.0,test,no_info,2005,no_info,0.0,no_info,20000.0,1,benzin,sonstige_autos,no_info,2016-03-14,39576,2016-04-06 00:46:52
360742,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,1199.0,test,cabrio,2000,automatik,101.0,fortwo,125000.0,3,benzin,smart,nein,2016-03-05,26135,2016-03-11 18:17:12
360743,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,9200.0,test,bus,1996,manuell,102.0,transporter,150000.0,3,diesel,volkswagen,nein,2016-03-19,87439,2016-04-07 07:15:26
360744,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,3400.0,test,kombi,2002,manuell,100.0,golf,150000.0,6,diesel,volkswagen,no_info,2016-03-20,40764,2016-03-24 12:45:21


In [8]:
df_cln.describe()

Unnamed: 0,dateCrawled,price,yearOfRegistration,powerPS,kilometer,monthOfRegistration,dateCreated,postalCode,lastSeen
count,360746,360746.0,360746.0,360746.0,360746.0,360746.0,360746,360746.0,360746
mean,2016-03-21 13:28:33.416300544,17812.03,1998.905718,116.575923,125661.823,5.703282,2016-03-20 19:38:18.377251584,50996.062246,2016-03-30 04:41:29.232967168
min,2016-03-05 14:06:22,1.0,1900.0,0.0,5000.0,-1.0,2014-03-10 00:00:00,1067.0,2016-03-05 14:15:08
25%,2016-03-13 12:50:46.249999872,1250.0,1998.0,72.0,100000.0,3.0,2016-03-13 00:00:00,30823.0,2016-03-23 12:48:59.249999872
50%,2016-03-21 18:06:19,3000.0,2003.0,105.0,150000.0,6.0,2016-03-21 00:00:00,49751.0,2016-04-04 02:44:25
75%,2016-03-29 14:50:34,7490.0,2007.0,150.0,150000.0,9.0,2016-03-29 00:00:00,71672.0,2016-04-06 10:45:56
max,2016-04-07 14:36:58,2147484000.0,2016.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,99998.0,2016-04-07 14:58:51
std,,3641176.0,21.095961,190.609039,39836.433575,3.837602,,25760.472206,


### Eliminação de Outliers

#### Exemplo de distribuição e quantil
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20201127112813/NORMALDISTRIBUTION-660x362.png"  width="80%" height="60%">

In [9]:
df_cln[['price', 'powerPS']].quantile(.02)

price      200.0
powerPS      0.0
Name: 0.02, dtype: float64

In [10]:
df_cln[['price', 'powerPS']].quantile(.98)

price      28500.0
powerPS      300.0
Name: 0.98, dtype: float64

In [12]:
print("Quantidade de linhas antes de eliminar os outliers:", len(df_cln))
list_quantile = ['price','powerPS']
df_aux = df_cln.copy()
for col in list_quantile:
    low_limit = df_aux[col].quantile(.02)
    high_limit = df_aux[col].quantile(.98)
    df_cln = df_cln[(df_cln[col]>low_limit) & (df_cln[col]<high_limit)]
print("Quantidade de linhas antes de eliminar os outliers:", len(df_cln))

Quantidade de linhas antes de eliminar os outliers: 360746
Quantidade de linhas antes de eliminar os outliers: 306747


In [13]:
df_cln.describe()

Unnamed: 0,dateCrawled,price,yearOfRegistration,powerPS,kilometer,monthOfRegistration,dateCreated,postalCode,lastSeen
count,306747,306747.0,306747.0,306747.0,306747.0,306747.0,306747,306747.0,306747
mean,2016-03-21 13:48:27.800284928,5340.119343,1999.688421,120.376809,126738.224009,5.936378,2016-03-20 19:53:57.919197440,51358.191099,2016-03-30 07:08:33.236644352
min,2016-03-05 14:06:23,202.0,1900.0,1.0,5000.0,-1.0,2015-03-20 00:00:00,1067.0,2016-03-05 14:15:08
25%,2016-03-13 12:43:02,1440.0,1999.0,80.0,125000.0,3.0,2016-03-13 00:00:00,31084.0,2016-03-23 16:49:53.500000
50%,2016-03-21 18:43:48,3300.0,2003.0,115.0,150000.0,6.0,2016-03-21 00:00:00,50354.0,2016-04-04 08:44:28
75%,2016-03-29 15:06:20,7450.0,2007.0,150.0,150000.0,9.0,2016-03-29 00:00:00,72160.0,2016-04-06 11:16:01
max,2016-04-07 14:36:58,28499.0,2016.0,299.0,150000.0,12.0,2016-04-07 00:00:00,99998.0,2016-04-07 14:58:51
std,,5444.798096,19.581832,49.774169,38105.625551,3.691866,,25740.20057,


### Gera os novos resultados

In [15]:
profile = ProfileReport(df_cln, title="Pandas Profiling Report - Versão Final")
profile.to_file("F:\ADA\AnalyticsEngineer\lucio\dados\\resultados-2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
df_cln.to_sql('base_autos_silver', engine, if_exists='replace', index=False)

747

In [17]:
df_cln.to_sql('base_autos_gold', engine, if_exists='replace', index=False)

747