# First Step: As always, make sure we can read the data!
I usually just use the top block to make sure I can actually read the data and install all the libraries beforehand since it can be a bit of a mess/hard to read doing this stuff later into the notebook! Here I'm just making sure that I can actually access the dataset for the most part (as well as seeing if there are any problems with the current Python environment/kernel on the side...)

In [2]:
# Install some useful packages for data analysis and visualization
!pip install pandas --break-system-packages
!pip install matplotlib --break-system-packages
!pip install seaborn --break-system-packages
!pip install scikit-learn --break-system-packages
!pip install tensorflow --break-system-packages
!pip install keras --break-system-packages


from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
data = pd.read_csv('FoodX.csv')

# Display the first few rows of the dataset
data.head()



Unnamed: 0,Year,Major,University,Time,Order
0,Year 2,Physics,Indiana State University,12,Fried Catfish Basket
1,Year 3,Chemistry,Ball State University,14,Sugar Cream Pie
2,Year 3,Chemistry,Butler University,12,Indiana Pork Chili
3,Year 2,Biology,Indiana State University,11,Fried Catfish Basket
4,Year 3,Business Administration,Butler University,12,Indiana Corn on the Cob (brushed with garlic b...


# Next step: Choosing a model

Perfect, we can read the dataset! Now it's time to choose a model, which is always a little fun. I'm going to try and go with a Random Forest or Decision Tree approach first since this just seems like a very easy classification problem and those are like the best (not to mention easiest) ways of handling that kind of dataset. I'm just going to use scikit-learn since it's the quickest way to get these two models up and running :)

# Random Forest Implementation

In [4]:
# Data preprocessing
# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
data['Year'] = label_encoder.fit_transform(data['Year'])
data['Major'] = label_encoder.fit_transform(data['Major'])
data['University'] = label_encoder.fit_transform(data['University'])
data['Time'] = label_encoder.fit_transform(data['Time'])
data['Order'] = label_encoder.fit_transform(data['Order'])

# Split the data into features (X) and target (y)
X = data.drop('Time', axis=1)  # Assuming 'Major' is the target variable
y = data['Time']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42)

# Create and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on new data
y_pred = rf_model.predict(X_test)

# Evaluate the model (you can use different metrics depending on your task)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.504


# Let's do some tuning!

Well that accuracy is a little underhwhelming! Let's try tuning some of those paramters to see if we can get better results...

In [8]:
# Define a parameter grid for Grid Search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Evaluate the best model on the test data
accuracy = best_estimator.score(X_test, y_test)
print(f"Best Model Parameters: {best_params}")
print(f"Accuracy on test data with best model: {accuracy:.2f}")



Best Model Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10}
Accuracy on test data with best model: 0.50


# Hey that's not bad!

We almost doubled our accuracy by just testing more parameters, before I go any deeper though I would like to try implementing decision trees to see how it performs against Random Forest. If a single decision tree can match or beat an ensemble of decision trees then it may be worth it to just go with that since the training time is greatly reduced and inferencing would scale a lot further, saving the business some costs.

Another reason for trying out decision trees is because finding something which improves the accuracy of a single decision tree should also easily transfer over to improving the accuracy of the Random Forest implementation, while not requiring nearly as much waiting around for it to finish training.

# Decision Tree Implementation

Let's see how it fairs against the Random Forest implementation

In [9]:
# Create a Decision Tree model and fit it to the training data
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

# Evaluate the model's performance on the test data
accuracy = decision_tree.score(X_test, y_test)
print(f"Accuracy on test data: {accuracy:.2f}")


Accuracy on test data: 0.52


# Celebrate!

Hey look at that, we got about the same accuracy as the Random Forest implementation with a *lot* less compute. Let's try and see how far we can go with a single decision tree.

In [10]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42)

# Create a Decision Tree classifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Define a parameter grid for Grid Search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Evaluate the best model on the test data
accuracy = best_estimator.score(X_test, y_test)
print(f"Best Model Parameters: {best_params}")
print(f"Accuracy on test data with best model: {accuracy:.2f}")



Best Model Parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy on test data with best model: 0.52


# So sad...

Looks like the best we're gonna get out of a decision trees and random forests is around ~52% accuracy

As a last hoorah, I'm going to try and implement a very simple neural network using keras and tensorflow to see if it can do any better than the tree based models

In [18]:
# Data preprocessing
# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
data['Year'] = label_encoder.fit_transform(data['Year'])
data['Major'] = label_encoder.fit_transform(data['Major'])
data['University'] = label_encoder.fit_transform(data['University'])
data['Time'] = label_encoder.fit_transform(data['Time'])
data['Order'] = label_encoder.fit_transform(data['Order'])

# Split the data into features (X) and target (y)
X = data.drop('Time', axis=1)  # Assuming 'Major' is the target variable
y = data['Time']

# One-hot encode the target variable (time of the food orders)
y = to_categorical(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a simple feedforward neural network
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model to the training data
model.fit(X_train, y_train, epochs=150, batch_size=16, verbose=1)

# Evaluate the model on the test data
accuracy = model.evaluate(X_test, y_test, verbose=0)[1]
print(f"Accuracy on test data with neural network: {accuracy:.2f}")


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

# Conclusions

Out of the all of the different models I tested, that being the Decision Trees, Random Forest and a simple Neural Network implementation, the Decision Tree ended up being the most performant while being slightly less accurate, but within margin of error considering all of these models plateued at around the same accuracy (~50%).

I think the dataset could be improved to include more columns since right now only a few colleges and majors are really represented with a pretty evenly distributed order time window, leading to a very generalized viewpoint of who is ordering what time that isn't very correlated with other user stats.

Finding what type of person is ordering at what time is also very hard to do just because the mass amount of people that order at the same time and the lack of extremities. This may just not be a good thing to look for.