# Project 6 Capstone - Part 4: Findings and Technical Report

---

---

# Executive Summary
This analysis utilized machine learning models to determine optimal business strategies for a new food service venture in Alabama, focusing on location selection and menu development.

### Location Analysis
Two clustering models, K-means and DBSCAN, were employed to identify ideal locations for business establishment:
* K-means (Silhouette Score: 0.443, Inertia: 15436534561.065)
* DBSCAN (Silhouette Score: 0.288, Inertia: 15436534561.065)
* Both models consistently highlighted Madison and Mobile counties as prime areas for initial investment. These locations offer a balance of population density and income levels, avoiding oversaturated markets while ensuring sufficient customer base, and should both be tested.

### Menu and Marketing Strategy
Two classification models, Logistic Regression and Random Forest, were used to determine optimal menu items and advertising focus:
* Logistic Regression (Accuracy: 0.9045, Precision: 0.9094, Recall: 0.9045, F1 Score: 0.9038)
* Random Forest (Accuracy: 0.8886, Precision: 0.8964, Recall: 0.8886, F1 Score: 0.8874)
* Key findings include: Focus items: Chicken, Potatoes, Beans; Items to avoid: Sushi, Caviar, Crab
* These results align with the goal of targeting cost-effective, popular items among lower-income groups while avoiding high-cost ingredients.

### Recommendations
- Establish initial operations and test in Madison and Mobile county
- Develop a menu centered around chicken, potatoes, and beans, etc.
- Concentrate paid search advertising on keywords related to these core ingredients
- Use high-cost items like sushi, caviar, and crab, etc. as negative keywords in advertising campaigns
This data-driven approach positions the business for success by targeting the right location and customer base while optimizing menu offerings and marketing strategies.

---

---

# Problem Statement:
As a data scientist at a startup food ghost kitchen company in Alabama, I have been tasked with market research that includes analyzing data on food value, food keywords, nutrition habits, physical activity levels, and obesity rates, and other factors, to identify distinct market segments across the state and determine top menu items to deliver. The goal is to develop models that will enable targeted marketing campaigns and tailored menu offerings, presenting the top locations with recommendations for top menu items and keywords to use for advertisement.

# Goals 
- Kmeans and DBSCAN: Determine the best location in Alabama to begin advertising my business and selling my products
- * Avoid top income area - higher cost preferences will not be met
  * Avoid lowest income area - not enough money to spend on our products
  * Avoid top population area - need to have room to grow and expand, don't want to fail and have nothing but smaller areas to consider
  * Avoid low populatio area - need to have enough people to advertise and sell to to make a reasonable profit
- Logistic Regression and Random Forest: Determine the best menu items to buy at scale and the best foods to advertise on in paid search
- * Focus on foods that are low cost to maximize net revenue of high volume purchases
  * Focus on meals and ingredients that are popular with lower income groups to make sure there is enough volume of search terms
  * Avoid foods that are high cost because they will not be on the menu and their search terms should be used as negative keywords

Variables of Interest
- Population
- Income
- Popular Foods

Modeling Choices
- K-means
- DBSCAN
- Logistic Regression
- Random Forest

# Model Performance and Metrics for Success:

### Supervised Classification Models for Menu and Marketing Strategy

Logistic Regression is a strong choice for text and keyword classification due to its:
* Simplicity and interpretability
* Efficiency with large datasets
* Good performance with linearly separable classes
* Ability to provide probability estimates

Performance:
* Accuracy: 0.9045 (90.45% correct predictions)
* Precision: 0.9094 (90.94% of positive predictions were correct)
* Recall: 0.9045 (90.45% of actual positive instances were identified)
* F1 Score: 0.9038 (harmonic mean of precision and recall)

Random Forest is also an excellent option for text and keyword classification because it:
* Handles non-linear relationships well
* Reduces overfitting through ensemble learning
* Provides feature importance rankings
* Performs well with high-dimensional data

Performance:
* Accuracy: 0.8886 (88.86% correct predictions)
* Precision: 0.8964 (89.64% of positive predictions were correct)
* Recall: 0.8886 (88.86% of actual positive instances were identified)
* F1 Score: 0.8874 (harmonic mean of precision and recall)

Comparison:
- Both models show strong performance, with Logistic Regression slightly outperforming Random Forest across all metrics:
* Accuracy: Logistic Regression is 1.59 percentage points higher
* Precision: Logistic Regression is 1.30 percentage points higher
* Recall: Logistic Regression is 1.59 percentage points higher
* F1 Score: Logistic Regression is 1.64 percentage points higher
Logistic Regression shows better performance in this case, and should be used moving forward.

### Unsupervised Clustering Models for Location Analysis

- Both K-means and DBSCAN are popular clustering algorithms, each with its own strengths:

K-means
* Performance:
* Silhouette Score: 0.443
* Inertia: 15,436,534,561.065

How it works:
* Randomly initializes k cluster centroids
* Assigns each data point to the nearest centroid
* Recalculates centroids based on assigned points
* Repeats steps 2-3 until convergence

Interpretation:
* Silhouette score indicates moderate cluster separation and cohesion
* Lower inertia suggests tighter clusters

DBSCAN
* Performance:
* Silhouette Score: 0.566
* Inertia: 2,544,347,384,021.328

How it works:
* Identifies core points with a minimum number of neighbors within a specified radius
* Expands clusters from core points
* Labels points not reaching the density threshold as noise

Interpretation:
* Higher silhouette score indicates better-defined clusters than K-means
* Higher inertia is not directly comparable due to different clustering approaches

### Comparison

Cluster Shape:
* K-means: Assumes spherical clusters
* DBSCAN: Can identify clusters of arbitrary shapes
Number of Clusters:
* K-means: Requires pre-defining the number of clusters
* DBSCAN: Automatically determines the number of clusters
Outlier Handling:
* K-means: Sensitive to outliers
* DBSCAN: Can identify and exclude outliers as noise
Scalability:
* K-means: Generally more scalable to large datasets
* DBSCAN: Can be less efficient with very large datasets

Performance on Data:
- DBSCAN shows a higher silhouette score, but comparing that score directly between these two models is not possible

Limitations and Future Improvements

* Feature Engineering: Experiment with different combinations of features or derived metrics
* Hyperparameter Tuning: For K-means: Try different values of k; For DBSCAN: Experiment with eps and min_samples parameters
* Dimensionality Reduction: Apply PCA before clustering to potentially improve results
* Ensemble Methods: Combine results from multiple models for more robust results
* Geospatial Considerations: Incorporate geographical coordinates, expand search for new locations outside Alabama
* Improve Data Acquisition: Obtain additional food preference information unique to the specific area where the food is going to be sold 
  
By considering these factors and potential improvements, this venture can refine clustering approaches and gain more meaningful insights.

# Results:
- Top Census Tracts in Alabama for intial investment
- * Madison and Mobile are the most frequently occuring county in the top census tracts from both models
- Top Meals and ingredients to focus on in keyword advertising and menu development
- * Chicken, Potatoes, Beans, etc.
- Top Meals and ingredients to avoid in keyword advertising and menu development
- * Sushi, Caviar, Crab, etc.