# Tunisia Real Estate CSV - Second Cleanup

### This cleanup involves reviewing the prices and surface area values of the apartments.

### Step 1: Load the data and inspect it

In [1]:
import pandas as pd

# Load the CSV file
file_path = '../../data/raw/tunisia-real-estate.csv'  # Update this path with the actual file location
data = pd.read_csv(file_path)

# Display the first few rows to understand the structure
data.head()

Unnamed: 0.1,Unnamed: 0,Governorate,Delegation,Locality,Nature,Type of Real Estate,Surface,Price,Inserted On
0,0,tunis,sidi el bechir,Sidi El Bechir,Sale,2-room apartment,70.0,120000.0,06/10/2023
1,1,tunis,marsa,La Marsa,Sale,3-room apartment,130.0,277000.0,23/10/2023
2,2,ariana,ariana ville,Ariana Ville,Sale,2-room apartment,111.0,195000.0,09/11/2023
3,3,tunis,ettahrir,Ettahrir,Sale,5-room apartment and more,150.0,268000.0,13/11/2023
4,4,tunis,marsa,La Marsa,Sale,3-room apartment,134.0,340000.0,08/11/2023


### Step 2: Remove rows with surface value less than 20m²

In [2]:
# Convert 'Surface' to numeric
data['Surface'] = pd.to_numeric(data['Surface'], errors='coerce')

# Filter out rows with surface < 20
filtered_data = data[data['Surface'] >= 20]

# Check the shape of the filtered data
filtered_data.shape

(2424, 9)

### Step 3: Remove rows with price value less than 10,000 TND

In [3]:
# Ensure 'Price' is treated as numeric and modify it explicitly using .loc
filtered_data.loc[:, 'Price'] = pd.to_numeric(filtered_data['Price'], errors='coerce')

# Filter out rows with price < 10,000 TND
filtered_data = filtered_data[filtered_data['Price'] >= 10000]

# Check the shape of the filtered data
print(filtered_data.shape)

(2416, 9)


### Step 4: Display houses with price more than 1,000,000 TND

In [4]:
# Display rows with price > 1,000,000 TND
high_price_houses = filtered_data[filtered_data['Price'] > 1000000]
high_price_houses

Unnamed: 0.1,Unnamed: 0,Governorate,Delegation,Locality,Nature,Type of Real Estate,Surface,Price,Inserted On
43,43,tunis,ain zaghouan,Ain Zaghouan,Sale,3-room apartment,267.0,1250000.0,28/02/2023
44,44,tunis,menzah,El Menzah,Sale,5-room apartment and more,500.0,1800000.0,19/10/2023
66,66,tunis,ouerdia,El Ouerdia,Sale,3-room apartment,75.0,9500000.0,17/11/2023
99,99,ariana,soukra,La Soukra,Sale,3-room apartment,320.0,1250000.0,30/11/2023
131,131,ariana,soukra,La Soukra,Sale,1-room apartment,350.0,1450000.0,23/11/2023
...,...,...,...,...,...,...,...,...,...
2290,2290,tunis,marsa,La Marsa,Sale,5-room apartment and more,470.0,2200000.0,22/12/2023
2364,2364,tunis,marsa,La Marsa,Sale,1-room apartment,280.0,1500000.0,12/12/2023
2398,2398,tunis,marsa,La Marsa,Sale,3-room apartment,440.0,1200000.0,29/12/2023
2420,2420,ben arous,ezzahra,Ezzahra,Sale,2-room apartment,250.0,6500000.0,22/12/2023


### Step 5: Remove rows with surface > 500m² and price > 2,000,000 TND

Rows with surface values over 500m² and price values over 2,000,000 TND were removed because they are more likely to be villas or non-apartment properties. This project is focused exclusively on apartments, and such values represent outliers or data points irrelevant to the analysis.

In [5]:
# Filter out rows with surface > 500m² or price > 2,000,000 TND
final_data = filtered_data[(filtered_data['Surface'] <= 500) | (filtered_data['Price'] <= 2000000)]

# Check the shape of the final dataset
final_data.shape

(2410, 9)

### Step 6: Save the cleaned dataset

In [6]:
# Save the cleaned dataset, overwriting the original file
final_data.to_csv(file_path, index=False)

print(f"Cleaned data has been saved and the original file has been overwritten: {file_path}")

Cleaned data has been saved and the original file has been overwritten: ../../data/raw/tunisia-real-estate.csv
