# Notebook 2 – Prétraitement et Feature Engineering

**Objectif :** Préparer les données du dataset Airbnb pour l'entraînement des modèles.

Ce notebook contient :
- Le traitement des valeurs manquantes
- L'encodage des variables catégorielles
- La création de nouvelles variables (feature engineering)
- L'enregistrement du dataset nettoyé


In [1]:
import pandas as pd
import numpy as np

# Chargement du dataset brut ou nettoyé
df = pd.read_csv("airbnb_train.csv")  # ou "airbnb_train_clean.csv"
print("Shape:", df.shape)
df.head()


Shape: (22234, 28)


Unnamed: 0,id,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,...,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,zipcode,bedrooms,beds
0,5708593,4.317488,House,Private room,"{TV,""Wireless Internet"",Kitchen,""Free parking ...",3,1.0,Real Bed,flexible,False,...,,33.782712,-118.13441,Island style Spa Studio,Long Beach,0,,90804,0.0,2.0
1,14483613,4.007333,House,Private room,"{""Wireless Internet"",""Air conditioning"",Kitche...",4,2.0,Real Bed,strict,False,...,2017-09-17,40.705468,-73.909439,"Beautiful and Simple Room W/2 Beds, 25 Mins to...",Ridgewood,38,86.0,11385,1.0,2.0
2,10412649,7.090077,Apartment,Entire home/apt,"{TV,""Wireless Internet"",""Air conditioning"",Kit...",6,2.0,Real Bed,flexible,False,...,,38.917537,-77.031651,2br/2ba luxury condo perfect for infant / toddler,U Street Corridor,0,,20009,2.0,2.0
3,17954362,3.555348,House,Private room,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1,1.0,Real Bed,flexible,True,...,2017-09-29,40.736001,-73.924248,Manhattan view from Queens. Lovely single room .,Sunnyside,19,96.0,11104,1.0,1.0
4,9969781,5.480639,House,Entire home/apt,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",4,1.0,Real Bed,moderate,True,...,2017-08-28,37.744896,-122.430665,Zen Captured Noe Valley House,Noe Valley,15,96.0,94131,2.0,2.0


## Traitement des valeurs manquantes


In [2]:
# Exemple de stratégie simple (à adapter) :
df['review_scores_rating'] = df['review_scores_rating'].fillna(df['review_scores_rating'].mean())
df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())
df['beds'] = df['beds'].fillna(df['beds'].median())
df['bathrooms'] = df['bathrooms'].fillna(df['bathrooms'].median())

# Colonnes booléennes
df['host_has_profile_pic'] = df['host_has_profile_pic'].fillna(False)
df['host_identity_verified'] = df['host_identity_verified'].fillna(False)
df['cleaning_fee'] = df['cleaning_fee'].fillna('False')  # si booléen ou montant

# Colonnes texte
df['description'] = df['description'].fillna('')
df['amenities'] = df['amenities'].fillna('')

df.isnull().sum().sort_values(ascending=False).head(10)


host_response_rate    5475
first_review          4725
last_review           4716
neighbourhood         2086
zipcode                303
host_since              56
amenities                0
accommodates             0
id                       0
log_price                0
dtype: int64

## Encodage des variables catégorielles


In [3]:
from sklearn.preprocessing import OneHotEncoder

cat_cols = ['room_type', 'property_type', 'bed_type', 'cancellation_policy', 'city']
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

df_encoded.shape


(22234, 68)

In [None]:
# Nombre d'éléments dans 'amenities'
df_encoded['n_amenities'] = df['amenities'].apply(lambda x: len(x.split(',')))

# Longueur de la description
df_encoded['desc_length'] = df['description'].apply(lambda x: len(x))

# Ancienneté de l'hôte en jours
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df_encoded['host_duration'] = (pd.to_datetime('today') - df['host_since']).dt.days.fillna(0)



##  Suppression des colonnes inutiles


In [6]:
df_encoded.drop(['id', 'name', 'description', 'amenities', 'first_review', 'last_review', 'zipcode', 'host_since'], axis=1, inplace=True)


In [7]:
df_encoded.to_csv("train_ready.csv", index=False)
print("✅ Données prêtes enregistrées dans 'train_ready.csv'")


✅ Données prêtes enregistrées dans 'train_ready.csv'
