## A.1) Data cleaning

Mention the data cleaning steps taken to prepare your data for developing the model. This may include imputing missing values, dealing with outliers, combining levels of categorical variable(s), etc.

## A.2) Exploratory data analysis

Mention any major insights you obtained from the data, which you used to develop the model. PLease put your code or visualizations here if needed.

- Insight 1: Correlation analysis. By examining the correlation between variables, I find many of them are correlated with each other, which make me think about perform a pca to capture the most significant variance in the data.
- Insight 2: Missing values analysis. There's a lot of missing values in the data. However, they are not very significant given the large number of observations.
- Insight 3: Data distribution. The distribution of variables indicate a log transformation might be helpful.
- Insight 4: Feature importance. Certain variables may have a stronger impact on the target variable compared to others. However, in this data, the RF feature importance doesn't give any significant result.


## A.3) Feature selection/reduction

Mention the steps for feature selection/reduction. PLease put your code or visualizations here if needed.

1. Perform PCA: Apply PCA to the dataset to reduce the dimensionality and capture the underlying patterns in the data. In the provided code snippet, `prcomp()` function is used to perform PCA on `train_2` dataset with scaling enabled (`scale. = TRUE`).

2. Summary of PCA: Obtain the summary of the PCA results to understand the variance explained by each principal component. This can be done using the `summary()` function on the PCA object (`pca`).

3. Variance explained: Calculate the proportion of variance explained by each principal component. In the given code, the square of the standard deviation of each principal component is divided by the sum of squared standard deviations to get the variance explained.

4. Scree plot: Create a scree plot to visualize the variance explained by each principal component. The code provided uses the `plot()` function to plot the variance explained against the principal component number.

5. Select a subset of principal components: Choose a subset of the principal components based on the desired number of components (`num_components`). In the code, the first `num_components` principal components are selected using indexing (`pca$x[, 1:num_components]`), and they are assigned to the `selected_components` variable.

These steps help in feature selection/reduction by identifying the principal components that capture the most significant variance in the data. The scree plot and variance explained information assist in determining the optimal number of principal components to retain.

## A.4) Developing the model

Mention the logical sequence of steps taken to obtain the final model. 

First, define the models. In this step, you define and train each individual model. Here, Model 1 represents model_glm, Model 2 represents model_xgb, and so on. The models can be different algorithms or variations of the same algorithm with different hyperparameters.

Second, make predictions on the test set. Using the trained models, you make predictions on the test set. For each model, you use the corresponding test data (e.g., test_pca) to generate predictions (pred_glm, pred_xgb, pred_svm).

Third, apply ensemble predictions. Combine the predictions from the individual models to create an ensemble prediction. In this case, the ensemble prediction is calculated as the average of the predictions from all the models: (pred_glm + pred_xgb + pred_svm) / 3.

Last, output the predictions. Save the ensemble predictions to a CSV file. In the given code snippet, the predictions are stored in a data frame with a single column (y), and then the data frame is written to a CSV file named "ensemble_predictions.csv" using the write_csv() function.

## A.5) Discussion

Please provide details of the models/approaches you attempted but encountered challenges or unfavorable outcomes. If feasible, kindly explain the reasons behind their ineffectiveness or lack of success. Additionally, highlight the significant challenges or issues you encountered during the process.

Initially, I attempted to use a Poisson model for regression, but it did not yield favorable results. Upon reviewing the code, I realized that the selection of the number of features for PCA might be the potential issue. Instead of choosing only a few variables that captured the most significant variance, I decided to increase the number of selected features to explain a larger portion of the data. Initially, this approach showed some improvement as it slightly reduced the RMSE score, but it still had limitations.

To address these limitations, I decided to explore alternative models such as KNN (K-Nearest Neighbors) and RF (Random Forest). These models offer different regression techniques and may potentially provide better results. By trying different algorithms, my goal is to enhance the accuracy and overall performance of the regression model. These lead me to think about using an ensemble model that can offer the flexibility to combine different types of models, such as decision trees, neural networks, or support vector machines. By leveraging the strengths of each model type, the ensemble can benefit from their complementary characteristics, resulting in improved overall performance. The ensemble model does achieve higher predictive accuracy compared to individual models. 

## A.6) Conclusion

* Do you feel that you gain valuable experience, skills, and/or knowledge? If yes, please explain what they were. If no, please explain.
* What are things you liked/disliked about the project and/or work on the project?

Yes, I gained a lot of valuable experience from participating in this Kaggle data competition. Throughout the project, I applied a wide range of data analysis techniques and enhanced my skills in various areas such as data preprocessing, feature engineering, model selection, and evaluation. I faced challenges at every stage of the competition, starting from data cleaning and preprocessing, to making informed decisions on feature selection, and finally developing models. One of the key aspects of my learning experience was the iterative nature of the competition. I constantly reflected on the performance of my models, analyzing the results and seeking opportunities for improvement. This process allowed me to think critically about the shortcomings of my initial approaches and encouraged me to explore alternative solutions. I tried different algorithms, adjusted hyperparameters, and experimented with various feature engineering techniques to enhance the predictive power of my models. By actively engaging in this iterative process, I not only refined my technical skills but also developed a deeper understanding of the strengths and limitations of different modeling approaches. It taught me the importance of critically evaluating model performance and continually refining my strategies to achieve better results.

One of the aspects I particularly enjoyed about this project was the iterative process of building on previous progress and continuously striving to improve the regression model to achieve the best possible score. Throughout the competition, there were instances where the results were unfavorable or not as expected, and it could be disheartening. However, I found the process of reflection and analysis after each step to be incredibly valuable.

Taking the time to reflect on the model's performance, understanding the reasons behind any setbacks or suboptimal results, and identifying areas for improvement allowed me to fine-tune my approach. It was through this continuous learning and adjustment process that I made progress and saw tangible improvements in the regression model. While encountering setbacks can be discouraging, the process of pushing through and persevering in the face of challenges was both rewarding and fulfilling. It taught me the importance of resilience and the value of persistence in achieving better outcomes. This project helped me develop a growth mindset, where I saw setbacks as opportunities for learning and improvement, rather than obstacles to success.
