# Examen EDA

Forest Covertype data (https://archive.ics.uci.edu/ml/datasets/Covertype )es un conjunto de datos cargado en la librería sklearn que permite realizar un ejercicio tipo problemas de clasificación. El objetivo de este dataset es estudiar las variables cartográficas para poder predecir el tipo de cubierta forestal. El tipo real de cubierta forestal para una observación (celda de 30 x 30 metros) se ha determinado a partir de los datos del Servicio Forestal de EE.UU. (USFS).

Los datos están en forma cruda (sin escalar) y contienen columnas binarias (0 o 1) de datos para variables independientes cualitativas (áreas silvestres y tipos de suelo).

Estas áreas de estudio representan bosques con mínimas perturbaciones causadas por el hombre, por lo que los tipos de cubierta forestal existentes son más el resultado de procesos ecológicos, que de prácticas de gestión forestal.

#### Importamos las bibliotecas

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import MinMaxScaler


#### Leemos el dataset

In [85]:
print("Cargando el conjunto de datos...")
df = pd.read_csv('covtype.data', header=None)
print("Conjunto de datos cargado.\n")

Cargando el conjunto de datos...


  df = pd.read_csv('covtype.data', header=None)


Conjunto de datos cargado.



#### Asignamos nombres a las columnas

In [87]:
print("Asignando nombres a las columnas...")
basic_cols = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 
              'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 
              'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
              'Horizontal_Distance_To_Fire_Points']

wilderness_cols = ['Wilderness_Area_' + str(i) for i in range(1, 5)]
soil_cols = ['Soil_Type_' + str(i) for i in range(1, 41)]

all_cols = basic_cols + wilderness_cols + soil_cols + ['Cover_Type']
df.columns = all_cols
print("Nombres de columnas asignados.\n")

Asignando nombres a las columnas...
Nombres de columnas asignados.



#### Preprocesamos las columnas Wilderness_Area y Soil_Type

In [88]:
print("Preprocesando las columnas Wilderness_Area y Soil_Type...")
df['Wilderness_Area'] = df[wilderness_cols].max(axis=1)
df.drop(wilderness_cols, axis=1, inplace=True)

df['Soil_Type'] = df[soil_cols].max(axis=1)
df.drop(soil_cols, axis=1, inplace=True)
print("Preprocesamiento completado.\n")

Preprocesando las columnas Wilderness_Area y Soil_Type...
Preprocesamiento completado.



#### Verificamos el contenido de nuestro DataFrame

In [90]:
print("Información del DataFrame:")
print(df.info())

Información del DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581013 entries, 0 to 581012
Data columns (total 13 columns):
 #   Column                              Non-Null Count   Dtype 
---  ------                              --------------   ----- 
 0   Elevation                           581013 non-null  object
 1   Aspect                              581013 non-null  object
 2   Slope                               581013 non-null  object
 3   Horizontal_Distance_To_Hydrology    581013 non-null  object
 4   Vertical_Distance_To_Hydrology      581013 non-null  object
 5   Horizontal_Distance_To_Roadways     581013 non-null  object
 6   Hillshade_9am                       581013 non-null  object
 7   Hillshade_Noon                      581013 non-null  object
 8   Hillshade_3pm                       581013 non-null  object
 9   Horizontal_Distance_To_Fire_Points  581013 non-null  object
 10  Cover_Type                          581013 non-null  object
 11  Wilderness_A

#### Verificamos si hay valores nulos

In [61]:
print("\nVerificando si hay valores nulos...")
nulos = df.isna().sum()
print("Número de valores nulos en cada columna:\n", nulos)

Elevation                             0
Aspect                                0
Slope                                 0
Horizontal_Distance_To_Hydrology      0
Vertical_Distance_To_Hydrology        0
Horizontal_Distance_To_Roadways       0
Hillshade_9am                         0
Hillshade_Noon                        0
Hillshade_3pm                         0
Horizontal_Distance_To_Fire_Points    0
Cover_Type                            0
Wilderness_Area                       0
Soil_Type                             0
dtype: int64

#### Verificamos si hay filas duplicadas

In [91]:
print("\nVerificando si hay filas duplicadas...")
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
    print(f'Total de filas duplicadas: {duplicate_count}')
else:
    print('No se encuentran filas duplicadas.')


Verificando si hay filas duplicadas...
No se encuentran filas duplicadas.


### Ejercicio 1

Para conseguir un dataset con una dimensión reducidad, aplica la técnica de Selección de variables basada en árbol de decisión mediante las importancias de cada variable (Decision Trees Importances):

Filtra el tablón para quedarnos solamente con las variables que aglutinan hasta el 95% de la información que se requiere para estimar la variable objetivo.
random_state=100

In [63]:
df_f1 = df.copy()

In [74]:
target = 'Cover_Type'
features = [x for x in df_f1.columns if x!=target]
arbol = DecisionTreeRegressor(random_state=100)

In [75]:
df_f1 = df.copy()
df_f1['Cover_Type'] = pd.factorize(df_f1['Cover_Type'])[0]


In [77]:
for col in features:
    df_f1[col] = pd.to_numeric(df_f1[col], errors='coerce')

df_f1.fillna(0, inplace=True)

arbol.fit(X=df_f1[features], y=df_f1[target])


### Variables importantes

In [78]:
importancias = arbol.feature_importances_

In [79]:
df_importancia = pd.DataFrame(arbol.feature_importances_, index=features, columns=["Importancia"])

df_importancia.sort_values(by=df_importancia.columns[0], ascending=False, inplace=True) #ORDENAR DATA FRAME: DECRECIENTE

df_importancia.head(10)

Unnamed: 0,Importancia
Elevation,0.350242
Horizontal_Distance_To_Fire_Points,0.1539
Horizontal_Distance_To_Roadways,0.124187
Vertical_Distance_To_Hydrology,0.070369
Horizontal_Distance_To_Hydrology,0.067988
Aspect,0.054868
Hillshade_9am,0.049818
Hillshade_Noon,0.047782
Hillshade_3pm,0.044413
Slope,0.036433


In [80]:
df_importancia["imp_acum"] = df_importancia["Importancia"].cumsum()
df_importancia

Unnamed: 0,Importancia,imp_acum
Elevation,0.350242,0.350242
Horizontal_Distance_To_Fire_Points,0.1539,0.504142
Horizontal_Distance_To_Roadways,0.124187,0.628329
Vertical_Distance_To_Hydrology,0.070369,0.698698
Horizontal_Distance_To_Hydrology,0.067988,0.766686
Aspect,0.054868,0.821553
Hillshade_9am,0.049818,0.871372
Hillshade_Noon,0.047782,0.919154
Hillshade_3pm,0.044413,0.963567
Slope,0.036433,1.0


In [81]:
df_importancia.loc[df_importancia['imp_acum']<=0.95]

Unnamed: 0,Importancia,imp_acum
Elevation,0.350242,0.350242
Horizontal_Distance_To_Fire_Points,0.1539,0.504142
Horizontal_Distance_To_Roadways,0.124187,0.628329
Vertical_Distance_To_Hydrology,0.070369,0.698698
Horizontal_Distance_To_Hydrology,0.067988,0.766686
Aspect,0.054868,0.821553
Hillshade_9am,0.049818,0.871372
Hillshade_Noon,0.047782,0.919154


In [82]:
variables = df_importancia.loc[df_importancia['imp_acum']>0.95].index
variables = list(variables)
print('Variables no importantes: ',variables)
print('Total de variables tras la eliminación: ',len(features) - len(variables) + 1)

Variables no importantes:  ['Hillshade_3pm', 'Slope', 'Wilderness_Area', 'Soil_Type']
Total de variables tras la eliminación:  9


In [84]:
df_bosque_f2 = df_f1.drop(labels=variables, axis='columns')
df_bosque_f2.head()

Unnamed: 0,Elevation,Aspect,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Horizontal_Distance_To_Fire_Points,Cover_Type
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,2596.0,51.0,258.0,0.0,510.0,221.0,232.0,6279.0,1
2,2590.0,56.0,212.0,-6.0,390.0,220.0,235.0,6225.0,1
3,2804.0,139.0,268.0,65.0,3180.0,234.0,238.0,6121.0,2
4,2785.0,155.0,242.0,118.0,3090.0,238.0,238.0,6211.0,2
