Updated rmd file with link to eda report, revised eda section

UBC-MDS · Dec 8, 2021 · 6a171a3 · 6a171a3
1 parent 034ce42
commit 6a171a3
Showing 1 changed file with 10 additions and 6 deletions.
diff --git a/reports/coffee_rating_prediction_report.Rmd b/reports/coffee_rating_prediction_report.Rmd
@@ -41,19 +41,21 @@ Further cleaning was performed on this dataset to remove all variables except `t
 
 #### Exploratory Data Analysis
 
-We performed analysis by exploring the relationship between the numeric features and the target variable, `total_cup_points` . We will use the following visualizations: a histogram showing the distribution of `total_cup_points` , and a heatmap of the correlation between the numeric features and `total_cup_points`.
+We performed exploratory data analysis by exploring the relationship between the numeric features and the target variable, `total_cup_points`. We will use the following visualizations: a histogram showing the distribution of `total_cup_points`, and a heatmap of the correlation between the numeric features and `total_cup_points`.
 
 ```{r heatmap, echo=FALSE, fig.cap="Figure 1. Distibution of Target variable, total_cup_points", out.width = '100%'}
 knitr::include_graphics("../results/images/target_histogram.png")
 ```
 
-In the above plot, we observed that the `total_cup_points` (target variable) have a left-skewed distribution. The distribution ranges from 60 to 90. Most of the values are between 80 to 85. The predictive model will learn the target data in this range.
+In the above plot, we observed that the `total_cup_points` (target variable) have a left-skewed distribution. The distribution ranges from 60 to 90. Most of the values are between 80 to 85. 
 
 ```{r histogram, echo=FALSE, fig.cap="Figure 2. Correlation heatmap of numeric features against target", out.width = '100%'}
 knitr::include_graphics("../results/images/correlation_matrix_heatmap.png")
 ```
 
-In the above correlation matrix, we observed that category two defects feature has the highest absolute correlation with `total_cup_points` . The second highest correlation is with category one defects followed by moisture. This feature will show up later during our modelling process as one of the top few important features. All features except quakers feature have a negative correlation with the target variable. Altitude mean meters feature seems to have a weak correlation to the target although this feature will later on prove to still be useful in classification modelling.
+In the above correlation matrix, we observed that `moisture` has the highest absolute correlation with `total_cup_points`. The second highest correlation is with `quakers`. All features except `quakers` feature have a negative correlation with the target variable. Altitude mean meters feature seems to have a weak correlation to the target although this feature will later on prove to still be useful in classification modelling. We will explore feature importances during our modelling process to see if they match these results from EDA.
+
+A detailed exploratiry data analysis report can be found [here](https://github.com/UBC-MDS/Coffee_quality_rating_predictor/blob/main/src/coffee_rating_eda.ipynb).
 
 #### Machine Learning Models
 
@@ -74,7 +76,7 @@ kable(col,
 
 As seen from the results of cross-validation, both Ridge (R\^2 score -0.065) and Random Forest Regression models (R\^2 score 0.169) did not perform very well on our dataset. The Random Forest Regression model had a higher validation score than the Ridge model, so we picked this model for hyperparameter optimization via the random search algorithm. This improved the R\^2 score to 0.25.
 
-```{r feature-imp rfr, echo=FALSE, fig.cap="Figure 3.. Random Forest Regressor Feature Importances", out.width = '70%'}
+```{r feature-imp rfr, echo=FALSE, fig.cap="Figure 3.Random Forest Regressor Feature Importances", out.width = '70%'}
 knitr::include_graphics("../results/images/feature_importance_rfr_plot.png")
 ```
 
@@ -92,10 +94,12 @@ Interestingly, the top 5 important features in this classifier included 3 simila
 
 ## Critique, Limitations and Future Improvements
 
-We faced several limitations in our analysis such as small dataset size (approximately thousand rows) and limited types of features available for feature engineering and modelling. In addition, many features had to be discarded due to their lack of relevance to our models. For example, features such as aroma, flavour, aftertaste, acidity, body, balance, uniformity, or sweetness were all discarded as they were just individual contributors to the calculation of the overall rating or the target variable.
+We faced several limitations in our analysis such as small dataset size (approximately thousand rows) and limited types of features available for feature engineering and modelling. In addition, many features had to be discarded due to their lack of relevance to our models. For example, features such as aroma, flavour, aftertaste, acidity, body, balance, uniformity, or sweetness were all discarded as they were just individual contributors to the calculation of the target variable (`total_cup_points`).
 
 Our model analysis may be improved with the inclusion of more relevant predictive features or more data. We could also try out adding more polynomial features and perform feature engineering such as combining features together to form new ones. More data cleaning methods could also be applied, such as inspection and removal of outliers, as these can affect regression model performance significantly. Additionally, we should also try out more classification models (Naive Bayes, Logistic Regression, etc.) as we see the advantage of using classification over regression models.
 
-This report was constructed using Rmarkdown[@Rrmarkdown] and Knitr[@Rknitr] Packages in R.
+
+
+This report was constructed using Rmarkdown [@Rrmarkdown] and Knitr [@Rknitr] Packages in R.
 
 ## References