# Rutina de preprocesamiento de datos

[Dataset Search](https://datasetsearch.research.google.com/)

## Creando y analizando un DF

In [74]:
import pandas as pd

In [75]:
df_meteorites = pd.read_csv('files/Meteorite_Landings.csv')
df_meteorites.sample(4)

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
11964,Frontier Mountain 03029,30587,Valid,H5,26.1,Found,01/01/2003 12:00:00 AM,-73.0,160.5,"(-72.95275, 160.49833)"
34581,Queen Alexandra Range 97947,21401,Valid,L5,51.6,Found,01/01/1997 12:00:00 AM,-84.0,168.0,"(-84.0, 168.0)"
21580,Lewis Cliff 93871,14593,Valid,H6,15.0,Found,01/01/1993 12:00:00 AM,-84.3,161.4,"(-84.26544, 161.40627)"
38918,Yamato 74490,24868,Valid,H6,11.5,Found,01/01/1974 12:00:00 AM,0.0,35.7,"(0.0, 35.66667)"


In [76]:
df_meteorites.shape

(45716, 10)

In [77]:
pd.options.display.float_format = '{:_.1f}'.format
df_meteorites.describe()

Unnamed: 0,id,mass (g),reclat,reclong
count,45_716.0,45_585.0,38_401.0,38_401.0
mean,26_889.7,13_278.1,-39.1,61.1
std,16_860.7,574_988.9,46.4,80.6
min,1.0,0.0,-87.4,-165.4
25%,12_688.8,7.2,-76.7,0.0
50%,24_261.5,32.6,-71.5,35.7
75%,40_656.8,202.6,0.0,157.2
max,57_458.0,60_000_000.0,81.2,354.5


In [78]:
df_meteorites.describe(include='all')

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
count,45716,45_716.0,45716,45716,45_585.0,45716,45425,38_401.0,38_401.0,38401
unique,45716,,2,466,,2,266,,,17100
top,Aachen,,Valid,L6,,Found,01/01/2003 12:00:00 AM,,,"(0.0, 0.0)"
freq,1,,45641,8285,,44609,3323,,,6214
mean,,26_889.7,,,13_278.1,,,-39.1,61.1,
std,,16_860.7,,,574_988.9,,,46.4,80.6,
min,,1.0,,,0.0,,,-87.4,-165.4,
25%,,12_688.8,,,7.2,,,-76.7,0.0,
50%,,24_261.5,,,32.6,,,-71.5,35.7,
75%,,40_656.8,,,202.6,,,0.0,157.2,


In [79]:
df_meteorites.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  object 
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 3.5+ MB


## Tipos de variable del DF

In [80]:
df_meteorites.dtypes

name            object
id               int64
nametype        object
recclass        object
mass (g)       float64
fall            object
year            object
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

Nueva funcionalidad que permite encontrar el formato más adecudo par las varaibles y convertirlas.

Int64 a diferencia de int64 (minuscula) puede entender los NaN

In [81]:
df_meteorites = df_meteorites.convert_dtypes()
df_meteorites.dtypes

name            string
id               Int64
nametype        string
recclass        string
mass (g)       Float64
fall            string
year            string
reclat         Float64
reclong        Float64
GeoLocation     string
dtype: object

conteo de valores distintos en cada columna

In [82]:
df_meteorites.nunique()

name           45716
id             45716
nametype           2
recclass         466
mass (g)       12576
fall               2
year             266
reclat         12738
reclong        14640
GeoLocation    17100
dtype: int64

## Categorización

*nametype* y *fall* son variables cetegóricas.

Vamos a convertir el tipo de datos de estas columnas a tipo *category*.

In [83]:
df_meteorites[['nametype','fall']]

Unnamed: 0,nametype,fall
0,Valid,Fell
1,Valid,Fell
2,Valid,Fell
3,Valid,Fell
4,Valid,Fell
...,...,...
45711,Valid,Found
45712,Valid,Found
45713,Valid,Found
45714,Valid,Found


In [84]:
df_meteorites[['nametype','fall']] = df_meteorites[['nametype','fall']].astype('category')
df_meteorites.dtypes

name             string
id                Int64
nametype       category
recclass         string
mass (g)        Float64
fall           category
year             string
reclat          Float64
reclong         Float64
GeoLocation      string
dtype: object

categorias de las que están compuestas estas columnas

In [85]:
df_meteorites['fall'].unique()

['Fell', 'Found']
Categories (2, object): ['Fell', 'Found']

cuantas veces aparace cada categoría

In [86]:
df_meteorites['fall'].value_counts()

Found    44609
Fell      1107
Name: fall, dtype: int64

In [87]:
pd.get_dummies(df_meteorites['fall'])

Unnamed: 0,Fell,Found
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
45711,0,1
45712,0,1
45713,0,1
45714,0,1


Con esto estamos finalmente categorizando la columna *fall* en dos columnas (fell y found). Esto es el mapeo de variables categóricas que vimos en el curso de probabilidad.

In [88]:
df_meteorites[['fell','found']] = pd.get_dummies(df_meteorites['fall'])
df_meteorites.head(4)

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,fell,found
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.8,6.1,"(50.775, 6.08333)",1,0
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.2,10.2,"(56.18333, 10.23333)",1,0
2,Abee,6,Valid,EH4,107_000.0,Fell,01/01/1952 12:00:00 AM,54.2,-113.0,"(54.21667, -113.0)",1,0
3,Acapulco,10,Valid,Acapulcoite,1_914.0,Fell,01/01/1976 12:00:00 AM,16.9,-99.9,"(16.88333, -99.9)",1,0


Ahora veamos la categorización de la columna *year*.

In [89]:
df_meteorites['year']

0        01/01/1880 12:00:00 AM
1        01/01/1951 12:00:00 AM
2        01/01/1952 12:00:00 AM
3        01/01/1976 12:00:00 AM
4        01/01/1902 12:00:00 AM
                  ...          
45711    01/01/1990 12:00:00 AM
45712    01/01/1999 12:00:00 AM
45713    01/01/1939 12:00:00 AM
45714    01/01/2003 12:00:00 AM
45715    01/01/1976 12:00:00 AM
Name: year, Length: 45716, dtype: string

In [90]:
pd.to_datetime(
    df_meteorites['year'], 
    errors='coerce',
    format='%m/%d/%Y %H:%M:%S %p'
)

0       1880-01-01 12:00:00
1       1951-01-01 12:00:00
2       1952-01-01 12:00:00
3       1976-01-01 12:00:00
4       1902-01-01 12:00:00
                ...        
45711   1990-01-01 12:00:00
45712   1999-01-01 12:00:00
45713   1939-01-01 12:00:00
45714   2003-01-01 12:00:00
45715   1976-01-01 12:00:00
Name: year, Length: 45716, dtype: datetime64[ns]

In [91]:
df_meteorites['year'] = pd.to_datetime(
    df_meteorites['year'], 
    errors='coerce',
    format='%m/%d/%Y %H:%M:%S %p'
)
df_meteorites.dtypes

name                   string
id                      Int64
nametype             category
recclass               string
mass (g)              Float64
fall                 category
year           datetime64[ns]
reclat                Float64
reclong               Float64
GeoLocation            string
fell                    uint8
found                   uint8
dtype: object

La categorización reduce el uso de memoria RAM y de almacenamiento.

## Renombramiento de las columnas

In [94]:
df_meteorites.rename(columns={'mass (g)':'mass'}, inplace=True)
df_meteorites.head(4)

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation,fell,found
0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01 12:00:00,50.8,6.1,"(50.775, 6.08333)",1,0
1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01 12:00:00,56.2,10.2,"(56.18333, 10.23333)",1,0
2,Abee,6,Valid,EH4,107_000.0,Fell,1952-01-01 12:00:00,54.2,-113.0,"(54.21667, -113.0)",1,0
3,Acapulco,10,Valid,Acapulcoite,1_914.0,Fell,1976-01-01 12:00:00,16.9,-99.9,"(16.88333, -99.9)",1,0


In [96]:
list(df_meteorites)

['name',
 'id',
 'nametype',
 'recclass',
 'mass',
 'fall',
 'year',
 'reclat',
 'reclong',
 'GeoLocation',
 'fell',
 'found']

In [97]:
df_meteorites.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass', 'fall', 'year', 'reclat',
       'reclong', 'GeoLocation', 'fell', 'found'],
      dtype='object')