In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Data Downloading

In [3]:
raw_data = pd.read_csv('./data/US/combined_listings.csv')

  raw_data = pd.read_csv('./data/US/combined_listings.csv')


# 2. Data Exploration and Preprocessing
Note: In practice, the process of data analysis often involves a combination of both data preprocessing and data exploration, and the order can be somewhat iterative.

## 2.1. Initial Data Exploration
Objective: Get a general sense of the data.

Activities:
1. Load the data.*
2. Conduct preliminary descriptive statistics (mean, median, mode, etc.).*
3. Generate basic visualizations (histograms, scatter plots, etc.).*
4. Look for obvious issues such as missing values, duplicates, or outliers.*

## 2.2. Data Proprocessing
Objective: Clean and prepare the data for detailed analysis and modeling.

Activities:
1. Handle missing values (imputation, deletion).*
2. Remove or correct inconsistencies and duplicates.* (????should we outliers ???)
3. Normalize or standardize numerical data.*
4. Encode categorical variables. (in our case, we don't need this unless we want to classify the interval of prices.)
5. Scale features.???

## 2.3. Detailed Data Exploration
Objective: Gain deeper insights into the cleaned and preprocessed data.

Activities:
1. Perform more detailed statistical analyses.*
2. Create more sophisticated visualizations to uncover hidden patterns.*
3. Conduct correlation analysis to identify relationships between variables.*
4. Formulate and test initial hypotheses based on the cleaned data.*

## 2.4 Iterative Refinement (Optional)
Objective: Refine your understanding and preparation of the data.

Activities:
1. Revisit data preprocessing steps as new insights are gained.
2. Continue exploring the data in greater detail as needed.
3. Adjust preprocessing techniques based on exploration findings (e.g., new outliers detected, better ways to handle missing values).*

# 3. Initial Model Building

Objective: Choose the most appropriate initial model based on data exploration and preprocessing insights.

We may use with keras sequential()

Activities:
1. Review the distribution of your target variable and main features.*
2. Ensure that the chosen model aligns with the nature of the data (e.g., linear vs. non-linear, categorical vs. numerical).*
3. Start with Simple Models simple baseline models like *
      1. Linear Regression for regression tasks 
      2. or Logistic Regression for classification tasks.
4. Ensure your data roughly meets the assumptions of the chosen simple model (e.g., linearity for Linear Regression).*
5. Reason or Discuss why choose this Initial Model *

## 3.1. Design Initial Neural Network (if applicable):
Determine Network Architecture:
1. Number of Layers: Start with a simple architecture, such as one or two hidden layers.
2. Number of Nodes per Layer: Begin with a small number of nodes, such as 10-50 per hidden layer, depending on the complexity of your data.
3. Activation Functions: Choose activation functions based on the nature of the data and the task:
   1. ReLU (Rectified Linear Unit): Common for hidden layers due to its simplicity and effectiveness.
   2. Sigmoid or Softmax: Use Sigmoid for binary classification outputs and Softmax for multi-class classification outputs.
   3. Linear Activation: Often used in the output layer for regression tasks.

# 4. Model Evaluation

Metrics and Technique(AKA Method)

1. Classification Models' Metrics
   1. Accuracy: The proportion of correctly classified instances among the total instances.
   2. Precision: The proportion of true positive instances among the instances predicted as positive.
   3. Recall (Sensitivity or True Positive Rate): The proportion of true positive instances among the actual positive instances.
   4. F1 Score: The harmonic mean of precision and recall, providing a single metric that balances both.
   5. Confusion Matrix: A table showing the true positives, true negatives, false positives, and false negatives.
   6. ROC Curve (Receiver Operating Characteristic Curve): A graphical representation of the true positive rate vs. the false positive rate at various threshold settings.
   7. AUC (Area Under the ROC Curve): A single metric that summarizes the ROC curve, indicating the model's ability to discriminate between classes.
   8. Log Loss (Cross-Entropy Loss): Measures the performance of a classification model where the output is a probability value between 0 and 1.
   9. Precision-Recall Curve: A graphical representation of precision vs. recall at various threshold settings.
   10. Specificity (True Negative Rate): The proportion of true negative instances among the actual negative instances.
2. Regression Models' Metrics
   1. Mean Absolute Error (MAE): The average of the absolute differences between the predicted and actual values.
   2. Mean Squared Error (MSE): The average of the squared differences between the predicted and actual values.
   3. Root Mean Squared Error (RMSE): The square root of the average of the squared differences between the predicted and actual values.
   4. R-Squared (Coefficient of Determination): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.
   5. Adjusted R-Squared: Adjusted version of R-Squared that accounts for the number of predictors in the model.
   6. Mean Absolute Percentage Error (MAPE): The average of the absolute percentage differences between the predicted and actual values.
   7. Median Absolute Error: The median of the absolute differences between the predicted and actual values.
3. Clustering Models' Metrics
   1. Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters.
   2. Davies-Bouldin Index: Measures the average similarity ratio of each cluster with the one most similar to it.
   3. Inertia (Within-Cluster Sum of Squares): Measures the compactness of the clusters, with lower values indicating better-defined clusters.
   4. Calinski-Harabasz Index (Variance Ratio Criterion): Measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion.
4. Time Series Models' Metrics
   1. Mean Absolute Error (MAE): The average of the absolute differences between the predicted and actual values.
   2. Mean Squared Error (MSE): The average of the squared differences between the predicted and actual values.
   3. Root Mean Squared Error (RMSE): The square root of the average of the squared differences between the predicted and actual values.
   4. Mean Absolute Percentage Error (MAPE): The average of the absolute percentage differences between the predicted and actual values.
   5. Mean Squared Logarithmic Error (MSLE): The average of the squared logarithmic differences between the predicted and actual values.
5. Model Evaluation Techniques
   1. Cross-Validation: A technique to assess how the results of a statistical analysis will generalize to an independent data set, commonly k-fold cross-validation.
   2. Train-Test Split: Splitting the dataset into a training set and a test set to evaluate the model's performance.
   3. Bootstrapping: A resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
   4. Learning Curve: A plot of model learning performance over time or over different training set sizes.
   5. Validation Curve: A plot of training and validation scores with respect to the model hyperparameters.

## 5.1. Choose Some Evaluation Methods and Implements them

## 5.2. Reasons for Choosing Those Metrics and Techniques
1. Why we choose these.

## 5.3. Discussion
1. What is the results of each metrics and techniques mean.
2. What is the follow up.

# 5. Hyperparameter Tuning
Objective: Optimize the hyperparameters of your model to improve its performance.

## 5.1. Identify Hyperparameters
Objective: Focus on the key hyperparameters that significantly impact model performance. 

## 5.2. Choose a Tunning Method
Objective: Ensure that the method chosen aligns with the resources available (e.g., time, computational power).

Select an appropriate method for hyperparameter tuning. Common methods include:
1. Grid Search: Exhaustively search through a specified subset of hyperparameters. (This is in our course material.)
2. Random Search: Randomly sample from a specified subset of hyperparameters.
3. Bayesian Optimization: Use probabilistic models to select the most promising hyperparameters based on past evaluations.

## 5.3. Run the Tuning Process

Note: need to split train_data into (train_data  val_data)

## 5.4.  Analyze Results

## 5.6. Validate the Optimized Model with test_data

# 6. Save the Document the final model