<a href="https://colab.research.google.com/github/Vinicius-L-R-Matos/teste_awari_aula_portifolio/blob/main/C%C3%B3pia_de_Data_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Limpeza Dataset Spotify

Hoje iremos explorar e realizar a limpeza de um dataset do spotify disponivel em :
[Dataset Spotify](
https://www.kaggle.com/vitoriarodrigues/spotifycsv-file-modified-for-data-cleaning)

Temos um conjunto de dados de músicas de 2017 com atributos da API do Spotify. Cada música possui a flag "1", o que significa que tiveram like , e "0" para as músicas que tiveram dislike.

Este dataset é disponibilizado para uma futura clusterização a partir dos principais atributos das músicas

Ele possui 13 atributos de cada música: acousticness, danceability, durationms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, timesignature, valence

Informações sobre o que essas características significam podem ser encontradas aqui: [API SPOTIFY](https://developer.spotify.com/web-api/get-audio-features/)

Rota: [Rota Track](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)


In [1]:
## Importação de Bibliotecas necessárias
import pandas as pd
import numpy as np

In [3]:
## Vamos ordernar os dados pela popularidade da música
df = pd.read_csv('spotify_data_cleaning.csv', index_col=0)
df.sort_values('song_popularity', ascending=False, inplace=True) 

FileNotFoundError: ignored

## Inspeção Dataset

In [None]:
## Temos no total 18835 linhas e 15 colunas
print(df.shape)

(18835, 15)


In [None]:
df.head(5)

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
1757,Party In The U.S.A.,nao_sei,0.8220000000000001kg,0.519mol/L,0.36,0.0,10.0,0.177,-8.575,0.0,0.105,97.42,4.0,0.7,
7574,I Love It (& Lil Pump),99,127946,0.0114kg,0.901mol/L,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329
11777,I Love It (& Lil Pump),99,127946,0.0114kg,0.901mol/L,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329
4301,I Love It (& Lil Pump),99,127946,0.0114kg,0.901mol/L,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329
14444,I Love It (& Lil Pump),99,127946,0.0114kg,0.901mol/L,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329


In [None]:
## O método .info() é muito pratico e retorna a quantidade de dados nulos do DataFrame, número total de entradas e datatypes de cada coluna.
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
## Verificando Tipo de Dados e Valores Não Nulos
## Inicialmente não possuimos dados nulo
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18835 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_name         18835 non-null  object 
 1   song_popularity   18835 non-null  object 
 2   song_duration_ms  18835 non-null  object 
 3   acousticness      18835 non-null  object 
 4   danceability      18835 non-null  object 
 5   energy            18835 non-null  object 
 6   instrumentalness  18835 non-null  object 
 7   key               18835 non-null  float64
 8   liveness          18835 non-null  object 
 9   loudness          18835 non-null  object 
 10  audio_mode        18835 non-null  object 
 11  speechiness       18835 non-null  object 
 12  tempo             18835 non-null  object 
 13  time_signature    18835 non-null  object 
 14  audio_valence     18834 non-null  float64
dtypes: float64(2), object(13)
memory usage: 2.3+ MB


In [None]:
## Fornece Informações como contagem, valores dos quartis, média, desvio padrão, máximo e mínimo.
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe
## Aqui observamos apenas duas colunas pois os formatos das outras esta como Object e assim ele não consegue calcular as agregações necessárias.
df.describe()

Unnamed: 0,key,audio_valence
count,18835.0,18834.0
mean,5.288674,0.527958
std,3.614624,0.244635
min,0.0,0.0
25%,2.0,0.335
50%,5.0,0.5265
75%,8.0,0.725
max,11.0,0.984


## Removendo duplicadas

In [None]:
## Validação de Duplicadas 
duplicados = df[df.duplicated()]
print(duplicados)

                               song_name  ... audio_valence
11777             I Love It (& Lil Pump)  ...         0.329
4301              I Love It (& Lil Pump)  ...         0.329
14444             I Love It (& Lil Pump)  ...         0.329
1229              I Love It (& Lil Pump)  ...         0.329
3443              I Love It (& Lil Pump)  ...         0.329
...                                  ...  ...           ...
14292  Get Dripped (feat. Playboi Carti)  ...         0.904
7273                       John Madden 2  ...         0.409
6514                        THIS OLE BOY  ...         0.764
14312    Transformer (feat. Nicki Minaj)  ...         0.287
7275                     Prince Charming  ...         0.605

[3903 rows x 15 columns]


In [None]:
## Quando necessário podemos utilizar o subset para especificar colunas especificas onde procuraremos duplicadas,
## Exemplo de uso em um cenário onde vc pode ter diversos valores iguais mas a combinação que não pode se repetir é em duas chaves especificas.
print(df[df.duplicated(subset=['song_name','audio_valence'])])

                             song_name  ... audio_valence
11777           I Love It (& Lil Pump)  ...        0.3290
4301            I Love It (& Lil Pump)  ...        0.3290
14444           I Love It (& Lil Pump)  ...        0.3290
1229            I Love It (& Lil Pump)  ...        0.3290
3443            I Love It (& Lil Pump)  ...        0.3290
...                                ...  ...           ...
7273                     John Madden 2  ...        0.4090
6514                      THIS OLE BOY  ...        0.7640
14312  Transformer (feat. Nicki Minaj)  ...        0.2870
7275                   Prince Charming  ...        0.6050
7939                           99 Pace  ...        0.0689

[4161 rows x 15 columns]


In [None]:
## Podemos remover estes valores duplicados utilizando o drop_duplicates
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(inplace=True) 
print(df.shape)
df.head(5)

(14932, 15)


Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
1757,Party In The U.S.A.,nao_sei,0.8220000000000001kg,0.519mol/L,0.36,0.0,10.0,0.177,-8.575,0.0,0.105,97.42,4.0,0.7,
7574,I Love It (& Lil Pump),99,127946,0.0114kg,0.901mol/L,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329
17588,"Taki Taki (with Selena Gomez, Ozuna & Cardi B)",98,212500,0.153kg,0.841mol/L,0.7979999999999999,3.33e-06,1.0,0.0618,-4.206,0.0,0.229,95.948,4.0,0.591
17394,Promises (with Sam Smith),98,213309,0.0119kg,0.7809999999999999mol/L,0.768,4.91e-06,11.0,0.325,-5.9910000000000005,1.0,0.0394,123.07,4.0,0.486
12665,Eastside (with Halsey & Khalid),98,173799,0.555kg,0.56mol/L,0.68,0.0,6.0,0.116,-7.648,0.0,0.321,89.391,4.0,0.319


## Validando consistência

Como vimos anteriormente temos campos que seriam númericos porém possuem texto e um texto que não condiz com o nome da coluna, aqui temos métricas de kg e mol/L

In [None]:
def remove_text (df, columns, text):
    for col in columns:
        df[col] = df[col].str.strip(text)

In [None]:
remove_text(df, ['acousticness', 'danceability'], 'mol/L')
remove_text(df, ['song_duration_ms', 'acousticness'], 'kg')

In [None]:
df.head(5)

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
1757,Party In The U.S.A.,nao_sei,0.8220000000000001,0.519,0.36,0.0,10.0,0.177,-8.575,0.0,0.105,97.42,4.0,0.7,
7574,I Love It (& Lil Pump),99,127946.0,0.0114,0.901,0.522,0.0,2.0,0.259,-8.304,1.0,0.33,104.053,4.0,0.329
17588,"Taki Taki (with Selena Gomez, Ozuna & Cardi B)",98,212500.0,0.153,0.841,0.7979999999999999,3.33e-06,1.0,0.0618,-4.206,0.0,0.229,95.948,4.0,0.591
17394,Promises (with Sam Smith),98,213309.0,0.0119,0.7809999999999999,0.768,4.91e-06,11.0,0.325,-5.9910000000000005,1.0,0.0394,123.07,4.0,0.486
12665,Eastside (with Halsey & Khalid),98,173799.0,0.555,0.56,0.68,0.0,6.0,0.116,-7.648,0.0,0.321,89.391,4.0,0.319


## Transformações DataType


In [None]:
## A Conversão do tipo da coluna pode ser feito por meio do .astype() que faz a conversão no pandas.
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html?highlight=astype#pandas.Series.astype
def to_type(df, columns, type):
    for col in columns:
        print(col)
        df[col] = df[col].astype(type)

numerical_cols = ['song_duration_ms', 'acousticness', 'danceability',
                  'energy', 'instrumentalness', 'liveness', 'loudness',
                  'speechiness', 'tempo', 'audio_valence']
 
categorical_cols = ['song_popularity', 'key', 'audio_mode', 'time_signature']

to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category') 

song_duration_ms
acousticness
danceability
energy


ValueError: ignored

In [None]:
## Com o replace podemos substituir demais valores que são inconsistentes 
df = df.replace(['nao_sei'], np.nan)

In [None]:
to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category') 

song_duration_ms
acousticness
danceability
energy
instrumentalness
liveness
loudness
speechiness


ValueError: ignored

In [None]:
df['speechiness'] = df['speechiness'].replace(['0.nao_sei'], np.nan)

In [None]:
to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category') 

song_duration_ms
acousticness
danceability
energy
instrumentalness
liveness
loudness
speechiness


ValueError: ignored

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14932 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_name         14932 non-null  object 
 1   song_popularity   14931 non-null  object 
 2   song_duration_ms  14932 non-null  float64
 3   acousticness      14932 non-null  float64
 4   danceability      14932 non-null  float64
 5   energy            14931 non-null  float64
 6   instrumentalness  14930 non-null  float64
 7   key               14932 non-null  float64
 8   liveness          14928 non-null  float64
 9   loudness          14931 non-null  float64
 10  audio_mode        14931 non-null  object 
 11  speechiness       14932 non-null  object 
 12  tempo             14931 non-null  object 
 13  time_signature    14931 non-null  object 
 14  audio_valence     14931 non-null  float64
dtypes: float64(9), object(6)
memory usage: 1.8+ MB


In [None]:
## Validação Variaveis Categoricas 
## Uma forma de validação é verificar a quantidade de elementos em cada uma das categorias. 
for col in categorical_cols:
  print(f'{col}')
  print(df[col].value_counts().sort_values())

song_popularity
99       1
100      1
98       4
97       4
96       5
      ... 
51     324
53     325
55     345
58     347
52     355
Name: song_popularity, Length: 101, dtype: int64
key
0.177        1
3.000      433
10.000    1045
8.000     1047
6.000     1048
4.000     1084
11.000    1223
5.000     1257
2.000     1399
9.000     1410
1.000     1596
7.000     1654
0.000     1735
Name: key, dtype: int64
audio_mode
0.105       1
0        5496
1        9434
Name: audio_mode, dtype: int64
time_signature
2800000000        1
0.7               1
0                 3
1                67
5               195
3               684
4             13980
Name: time_signature, dtype: int64


In [None]:
## Acima observamos em cada uma das colunas alguns valores estranhos como 0.177 para key, e 0.105 para uma variavel binaria como audio_mode;
df['key'] = df['key'].replace([0.177], np.nan)
df['audio_mode'] = df['audio_mode'].replace(['0.105'], np.nan)
df['time_signature'] = df['time_signature'].replace(['0.7', '2800000000'], np.nan)

A partir de agora temos um dataset com o minimo de consistencia e sem valores duplicados

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14932 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_name         14932 non-null  object 
 1   song_popularity   14931 non-null  object 
 2   song_duration_ms  14932 non-null  float64
 3   acousticness      14932 non-null  float64
 4   danceability      14932 non-null  float64
 5   energy            14931 non-null  float64
 6   instrumentalness  14930 non-null  float64
 7   key               14931 non-null  float64
 8   liveness          14928 non-null  float64
 9   loudness          14931 non-null  float64
 10  audio_mode        14930 non-null  object 
 11  speechiness       14932 non-null  object 
 12  tempo             14931 non-null  object 
 13  time_signature    14929 non-null  object 
 14  audio_valence     14931 non-null  float64
dtypes: float64(9), object(6)
memory usage: 1.8+ MB


In [None]:
## Validando quantidade de nulos de forma explicita
df.isna().sum()

song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              1
instrumentalness    2
key                 1
liveness            4
loudness            1
audio_mode          2
speechiness         0
tempo               1
time_signature      3
audio_valence       1
dtype: int64

In [None]:
## Também podemos remover valores negativos que não fazem sentido do ponto de vista do problema como duração da música negativa
df[df[numerical_cols]<0].count()

TypeError: ignored

## Remoção de Colunas

Algumas colunas podem ser consideradas desnecessárias para nossa análise, isso porque elas não nos passam informações relevantes a respeito do que queremos descobrir, ou até mesmo porque possuem tantos dados faltantes que mais atrapalham do que ajudam. Nesses casos uma forma rápida e fácil de solucionar esse problema seria excluí-las. 

Aqui eliminaremos apenas uma a nivel de experimentação.

In [None]:
## Podemos deletar passando o index e a orientação onde axis = 0 representa linhas e axis = 1 representa colunas
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df.drop(['liveness'], axis=1)

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
1757,Party In The U.S.A.,,0.822,0.51900,0.360,0.000,10.000000,,0.000,,97.42,4,,
7574,I Love It (& Lil Pump),99,127946.000,0.01140,0.901,0.522,0.000000,2.0,-8.304,1,0.33,104.053,4,0.329
17588,"Taki Taki (with Selena Gomez, Ozuna & Cardi B)",98,212500.000,0.15300,0.841,0.798,0.000003,1.0,-4.206,0,0.229,95.948,4,0.591
17394,Promises (with Sam Smith),98,213309.000,0.01190,0.781,0.768,0.000005,11.0,-5.991,1,0.0394,123.07,4,0.486
12665,Eastside (with Halsey & Khalid),98,173799.000,0.55500,0.560,0.680,0.000000,6.0,-7.648,0,0.321,89.391,4,0.319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11278,María,0,161986.000,0.90600,0.843,0.483,0.005230,3.0,-14.776,1,0.0638,141.295,4,0.964
12923,Unfuck The World,0,250213.000,0.00142,0.574,0.831,0.010800,7.0,-5.576,0,0.0325,101.988,4,0.518
11282,Kimbya (feat. Manny Roman),0,261590.000,0.49600,0.418,0.958,0.058300,7.0,-5.678,1,0.0728,123.639,4,0.676
12905,Mad World,0,174253.000,0.00002,0.298,0.931,0.404000,2.0,-6.185,1,0.13,135.97,4,0.404


In [None]:
## Ou podemos deletar diretamente passando o parametro columns
df.drop(columns=['liveness'], inplace=True)

## Dados faltantes Missing Values


Em algumas situações, podemos ter muitas informações incompletas no nosso df. Essas informações faltantes podem prejudicar nossa análise e outras etapas que dependem dela e do pré-processamento, portanto, precisamos removê-los ou substituir esses valores por outros. O fluxo a seguir pode auxiliar na decisão e trazer sugestões de como tratar cada caso.

![alt text](https://res.cloudinary.com/dyd911kmh/image/upload/v1633673400/handling-missing-values-diagram_xr4ryx.png "Fluxo de Tratamento Dados Faltantes https://www.datacamp.com/community/tutorials/data-preparation-with-pandas")

Para dados que não são séries temporais, nossa primeira opção é substitui-los pela média da coluna, entretanto, às vezes, a média pode ter sido afetada pelos valores destoantes da coluna (outliers), então podemos substituir também pela moda ou mediana. 

Podemos fazer isso com a função .fillna que preenche todos os campos com dados ausentes. Vamos criar alguns loops como exemplo.
O primeiro passa por algumas colunas e substitui os valores faltantes pela moda:






In [None]:
df.isna().sum()

song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              1
instrumentalness    2
key                 1
loudness            1
audio_mode          2
speechiness         0
tempo               1
time_signature      3
audio_valence       1
dtype: int64

In [None]:
## Substituição Pela Moda
## Doc .fillna - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

for column in ['acousticness', 'speechiness']:
    df[column].fillna(df[column].mode()[0], inplace=True)

In [None]:
## Substituição pela Mediana 
for column in ['song_duration_ms',  'danceability', 'energy', 
                'loudness', 'audio_valence']:
    df[column].fillna(df[column].median(), inplace=True)

In [None]:
df.isna().sum()

song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              0
instrumentalness    2
key                 1
loudness            0
audio_mode          2
speechiness         0
tempo               1
time_signature      3
audio_valence       0
dtype: int64

In [None]:
## Para excluir apenas os valores faltantes temos a função
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna
df.dropna(inplace=True)

In [None]:
df.isna().sum()

song_name           0
song_popularity     0
song_duration_ms    0
acousticness        0
danceability        0
energy              0
instrumentalness    0
key                 0
loudness            0
audio_mode          0
speechiness         0
tempo               0
time_signature      0
audio_valence       0
dtype: int64

## Conclusão

Ao final temos o dataset pronto para a análise exploratória, aqui ainda não tratamos outliers pois dependendo do cenário podemos fazer uso dos outliers, sendo decidido no próximo passo.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14925 entries, 7574 to 9956
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_name         14925 non-null  object 
 1   song_popularity   14925 non-null  object 
 2   song_duration_ms  14925 non-null  float64
 3   acousticness      14925 non-null  float64
 4   danceability      14925 non-null  float64
 5   energy            14925 non-null  float64
 6   instrumentalness  14925 non-null  float64
 7   key               14925 non-null  float64
 8   loudness          14925 non-null  float64
 9   audio_mode        14925 non-null  object 
 10  speechiness       14925 non-null  object 
 11  tempo             14925 non-null  object 
 12  time_signature    14925 non-null  object 
 13  audio_valence     14925 non-null  float64
dtypes: float64(8), object(6)
memory usage: 1.7+ MB


In [None]:
df.head()

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
7574,I Love It (& Lil Pump),99,127946.0,0.0114,0.901,0.522,0.0,2.0,-8.304,1,0.33,104.053,4,0.329
17588,"Taki Taki (with Selena Gomez, Ozuna & Cardi B)",98,212500.0,0.153,0.841,0.798,3e-06,1.0,-4.206,0,0.229,95.948,4,0.591
17394,Promises (with Sam Smith),98,213309.0,0.0119,0.781,0.768,5e-06,11.0,-5.991,1,0.0394,123.07,4,0.486
12665,Eastside (with Halsey & Khalid),98,173799.0,0.555,0.56,0.68,0.0,6.0,-7.648,0,0.321,89.391,4,0.319
17618,In My Feelings,98,217925.0,0.0589,0.835,0.626,6e-05,1.0,-5.833,1,0.125,91.03,4,0.35


# Desafio AirBnb Dataset

Como desafio você deve realizar toda a preparação e limpeza dos dados do dataset a seguir do Airbnb

**Dataset**

Desde 2008, hóspedes e anfitriões têm usado o Airbnb para expandir as possibilidades de viagem e apresentar uma forma mais única e personalizada de experimentar o mundo. Este conjunto de dados descreve a atividade de listagem e as métricas em NYC, NY para 2019.

Este arquivo de dados inclui todas as informações necessárias para descobrir mais sobre hosts, disponibilidade geográfica, métricas necessárias para fazer previsões e tirar conclusões.

In [4]:
airbnb_url = 'https://raw.githubusercontent.com/ManarOmar/New-York-Airbnb-2019/master/AB_NYC_2019.csv'
airbnb_ori = pd.read_csv(airbnb_url)
airbnb = airbnb_ori.copy()
wrong_spelling = ['manhatann', 'brookln', 'Blookn', 'Quinns','Broonx'] * 10
random_index = airbnb.sample(50, random_state = 10).index
airbnb.loc[random_index,'neighbourhood_group'] = wrong_spelling
airbnb.loc[random_index,'name'] = 'nan'

In [7]:
df = pd.read_csv('AB_NYC_2019.csv', index_col=0)

In [8]:
## Coloque seu código abaixo:
df.sort_values('number_of_reviews', ascending=False, inplace=True) 

In [9]:
print(df.shape)

(21908, 15)


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21908 entries, 9145202 to 17610107
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   name                            21893 non-null  object 
 1   host_id                         21908 non-null  int64  
 2   host_name                       21894 non-null  object 
 3   neighbourhood_group             21908 non-null  object 
 4   neighbourhood                   21908 non-null  object 
 5   latitude                        21908 non-null  float64
 6   longitude                       21908 non-null  float64
 7   room_type                       21907 non-null  object 
 8   price                           21907 non-null  float64
 9   minimum_nights                  21907 non-null  float64
 10  number_of_reviews               21907 non-null  float64
 11  last_review                     18227 non-null  object 
 12  reviews_per_month      

In [11]:
df.describe()

Unnamed: 0,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,21908.0,21908.0,21908.0,21907.0,21907.0,21907.0,18227.0,21907.0,21907.0
mean,24241600.0,40.729604,-73.956968,151.386041,7.417949,35.79381,0.940766,3.734377,103.542658
std,26746500.0,0.053034,0.039019,235.825096,23.923376,58.921168,1.293812,12.981111,132.702015
min,2571.0,40.49979,-74.24285,10.0,1.0,0.0,0.01,1.0,0.0
25%,3948748.0,40.68982,-73.983283,73.0,2.0,1.0,0.1,1.0,0.0
50%,14137200.0,40.72341,-73.95807,110.0,3.0,9.0,0.36,1.0,9.0
75%,36360020.0,40.763955,-73.941558,175.0,5.0,45.0,1.3,2.0,222.0
max,119706600.0,40.90804,-73.71299,10000.0,1250.0,629.0,16.22,121.0,365.0


In [12]:
duplicados = df[df.duplicated()]
print(duplicados)

Empty DataFrame
Columns: [name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count, availability_365]
Index: []
