# Modelos de clasificación básicos

In [1]:
import pandas as pd

url_train = 'https://practicum-content.s3.us-west-1.amazonaws.com/datasets/train_data_us.csv'
df = pd.read_csv(url_train)

print(df.shape)
df.describe()

(6495, 14)


Unnamed: 0,last_price,total_area,bedrooms,ceiling_height,floors_total,living_area,floor,bike_parking,studio,open_plan,kitchen_area,balcony,airports_nearest,cityCenters_nearest
count,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0,6495.0
mean,161005.7,65.588209,2.239569,2.780531,10.855427,38.019549,5.922248,0.002309,0.0,0.0,11.057521,0.651578,27996.794303,11460.108699
std,235172.9,39.630351,1.163771,0.687026,6.053443,25.101307,4.635593,0.048005,0.0,0.0,6.575596,1.009999,11581.935337,4776.693612
min,243.8,17.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,181.0
25%,87000.0,42.5,1.0,2.55,5.0,20.0,3.0,0.0,0.0,0.0,7.1,0.0,18446.0,8180.0
50%,113000.0,56.2,2.0,2.7,9.0,32.0,5.0,0.0,0.0,0.0,9.5,0.0,26402.0,12255.0
75%,166000.0,75.15,3.0,2.85,15.5,45.5,8.0,0.0,0.0,0.0,12.3,1.0,36421.0,14881.0
max,8400000.0,900.0,16.0,27.5,52.0,409.7,29.0,1.0,0.0,0.0,112.0,5.0,54723.0,29343.0


    last_price — precio al cierre (en dólares)
    total_area — superficie del apartamento en metros cuadrados (m²)
    bedrooms — número de dormitorios
    ceiling_height — altura del techo (m)
    floors_total — número total de pisos en el edificio
    living_area — superficie de sala de estar (m²)
    floor — piso
    bike_parking — estacionamiento de bicicletas en el edificio (tipo de dato booleano)
    is_studio — la propiedad es un estudio (tipo de dato booleano)
    is_open_plan — plan abierto (tipo de dato booleano)
    kitchen_area — área de cocina (m²)
    balconies — número de balcones
    airport_dist — distancia al aeropuerto más cercano en metros (m)
    city_center_dist — distancia al centro de la ciudad (m)

Cambia la tarea original a una tarea de clasificación. Crea una nueva característica llamada price_class. Para precios mayores a $113 000, asigna price_class a 1. Para precios menores o iguales a \\$113 000, asigna 'price_class' a 0. Imprime las primeras cinco filas de la tabla (que ya están en precodificación).

In [2]:
df.loc[df['last_price'] > 113000, 'price_class'] = 1
df.loc[df['last_price'] <= 113000, 'price_class'] = 0

# División del conjunto de datos

In [3]:
from sklearn.model_selection import train_test_split

df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345, )

In [4]:
features_train = df_train.drop(['last_price', 'price_class'], axis=1)
target_train = df_train['price_class']

features_valid = df_valid.drop(['last_price', 'price_class'], axis=1)
target_valid = df_valid['price_class']

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(4871, 13)
(4871,)
(1624, 13)
(1624,)


# DecisionTreeClassifier

In [5]:
from sklearn.tree import DecisionTreeClassifier

model_DTC = DecisionTreeClassifier(random_state=12345, max_depth=5)

model_DTC.fit(features_train,target_train)

# Bosque aleatorio

In [6]:
from sklearn.ensemble import RandomForestClassifier

model_RFC = RandomForestClassifier(random_state=54321, n_estimators=5)

model_RFC.fit(features_train, target_train)

In [7]:
# Función manual para la selección del mejor número de estimadores

best_score = 0
best_est = 0
for est in range(1, 11): 
    
    model = RandomForestClassifier(random_state=54321, n_estimators=est) 
    model.fit(features_train, target_train) 
    score = model.score(features_valid, target_valid) 
    if score > best_score:
        best_score = score 
        best_est = est 

print("La exactitud del mejor modelo en el conjunto de validación (n_estimators = {}): {}".format(best_est, best_score))

La exactitud del mejor modelo en el conjunto de validación (n_estimators = 9): 0.8879310344827587


# Regresión logística

In [8]:
from sklearn.linear_model import LogisticRegression

model_LR = LogisticRegression(random_state=54321, solver='liblinear')
model_LR.fit(features_train, target_train)

# Calidad del modelo

In [9]:
# VERSION 1: Con funciones

target_predictions = model_DTC.predict(features_train)

def error_count(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count

def accuracy(answers, predictions):
    errors = error_count(answers, predictions)
    return (len(answers) - errors) / len(predictions)

print('Errores:', error_count(target_train.values, target_predictions))
print('Accuracy:', accuracy(target_train.values, target_predictions))

Errores: 507
Accuracy: 0.8959145965920755


AQUI

PRECISION (PRECISION)
SENSIBILIDAD (RECALL)
EXACTITUD (ACCURACY)

In [10]:
# VERSION 2: Con una librería

from sklearn.metrics import accuracy_score

target_predictions = model_DTC.predict(features_train)

accuracy = accuracy_score(target_train, target_predictions)
print(accuracy)

0.8959145965920755


accuracy_score is a standalone function from sklearn.metrics that directly compares the true labels (target_train) with predicted labels (target_predictions). It requires you to first generate predictions externally (e.g., using model.predict(features_train)) and then calculate the accuracy. It is more flexible because you can compute accuracy for any set of true and predicted labels, including predictions from different models or manual predictions.

Explorar más esa librería

In [11]:
# OPCIÓN 3

score = model_DTC.score(features_train, target_train) 
print(score)

0.8959145965920755


model.score is a method of the trained model object that internally makes predictions on the input features (features_train) and compares those predictions to the true labels (target_train) to compute the accuracy. It combines prediction and accuracy calculation in one call and is less flexible but more convenient for quick evaluation directly on the model.