# 1. Installing requirements

Just run the cell below to configure the environment and install packages for this work

In [None]:
# create environment and install requirements
!python3 -m venv .venv
!source .venv/bin/activate
!pip install -r requirements.txt

# 2. Framing the problem

## 2.1 Goal

We want to predict an Airbnb housing price, based in some aspects of a thousand of orders grouped in a huge dataset
- Features like location, review score, how many bedrooms, etc.


## 2.2 Algorithm

- Learning 
  - **Supervised learning**
    - We're predicting housing prices, a known target variable within our dataset.
  - **Batch learning**
    - We're training on the entire Kaggle dataset at once ! We have the dataset split into months, but since we aren't doing time-based forecasting, we'll analyze the entire dataset as a whole.
  - **Model based**
    - We'll create a predictive model to forecast housing prices 
- Model task : **Linear regression**
  - This is a regression problem, as we aim to predict the numerical value of AirBnB housing prices in Brazilian Real.

## 2.3 Performance mesure for evaluation : **RSME or MAE**
Choosing the right performance measure, either RMSE or MAE, **depends heavily on the characteristics of our dataset**, particularly the presence of outliers and the data's distribution.
  - RMSE (Root Mean Squared Error) **is more sensitive to outliers** due to the squaring of errors. This makes it suitable for datasets with a balanced, bell-shaped distribution and few outliers.
  - On the other hand, MAE (Mean Absolute Error) is more robust to outliers as it considers the absolute difference between predictions and actual values. MAE is generally preferred for those with a **significant presence of outliers**.

# 3. Loading the dataset

⚠️ I received a DtypeWarning ! Some columns has mixed types. When we will analyse the dataset, we'll need to handle this.

In [3]:
import pandas as pd
import numpy as np

df_raw = pd.read_csv("data/total_data.csv")

  df_raw = pd.read_csv("data/total_data.csv")


# 4. EDA analysis and to-do list

## 4.1 **.head()** : first glance

We can see, at this first glance, that this dataset has a bunch of textual information, but we're going to predict a numerical value. Let's keep analyzing !

In [4]:
df_raw.head(2)

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,...,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,number_of_reviews_ltm,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,0.0,17878,https://www.airbnb.com/rooms/17878,20180820000000.0,2018-08-16,Very Nice 2Br - Copacabana - WiFi,Please note that special rates apply for New Y...,- large balcony which looks out on pedestrian ...,Please note that special rates apply for New Y...,none,...,,,,,,,,,,
1,1.0,24480,https://www.airbnb.com/rooms/24480,20180820000000.0,2018-08-16,Nice and cozy near Ipanema Beach,My studio is located in the best of Ipanema. ...,The studio is located at Vinicius de Moraes St...,My studio is located in the best of Ipanema. ...,none,...,,,,,,,,,,


## 4.2 **.info()** : value types

We got 72 object-type columns ! More than the half of all the dataset columns. We'll need effectively take a deep look inside, to gather some good information from this columns. The goal is to encode some of them, after filtering the good ones, to numerical. For this, we have two tasks in those object-type columns: 1) identify interesting data and 2) verify if they're able to be encoded.


**"What is encoding ? Why encode ?"**
- As we're doing a linear regression task, our model need to be fed with numerical data. Texts are in string format, then, we need to find a way to transform them ! In NLP cases, we tokenize the text, but here, we're just searching for **categorical features**. After, we need to identify if they're **ordinal** or **numerical** ones. 

In [6]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 784122 entries, 0 to 784121
Columns: 108 entries, Unnamed: 0 to calculated_host_listings_count_shared_rooms
dtypes: float64(36), object(72)
memory usage: 646.1+ MB


## 4.? **.nunique()** and **value_counts()**

Let's get deep in those object-like data using two Pandas functions : **.nunique()** and **.value_counts()**.

- **.nunique** counts unique values in a column, while **.value_counts** counts the occurrences of each unique value. 
- Both are going to be important to help us choosing the best features to keep among the 72 columns.

In [None]:
# Calculate nunique for 'object' columns and convert to DataFrame
nunique_df = pd.DataFrame(df_raw.select_dtypes(include='object').astype(str).nunique()).reset_index()

# Rename columns for clarity
nunique_df.columns = ['Column Name', 'Unique Value Count']

# Filter rows where 'Unique Value Count' is less than 10
filtered_df = nunique_df[nunique_df["Unique Value Count"] < 10]

# Display the filtered DataFrame
print(filtered_df)

# 5. ETC