## Alice Tang ADAN 7431 HW 1

### For this assignment, we will be participating in Kaggle's 'What's Cooking' Competition: https://www.kaggle.com/competitions/whats-cooking/overview 

### Here, we are tasked with predicting the category of a dish's cuisine given a list of their ingredients.

To start, let's import some needed libraries. 

In [1]:
# Import needed packages.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

Great, now let's load our training data.

In [2]:
# Set the file path and load the training data. 
train_file_path = r"C:\Users\alice\Downloads\whats-cooking\train.json\train.json"
train_data = pd.read_json(train_file_path)

Next, we'll print the counts of each cuisine to check for class imbalances. 

In [3]:
print(train_data['cuisine'].value_counts())

cuisine
italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: count, dtype: int64


The number of examples vary across the different cuisines. Italian and Mexican cuisines have the largest representation, whereas Brazilian and Russian have the least. We should keep this in mind.

### Let's do some text preprocessing! 
#### We'll start by converting the ingredients into a text format and then doing TF-IDF Vectorization. 
Converting the ingredients to text format is essential, since our ML models generally only work with numerical data. By converting the list of ingredients to a text format, each recipe can be represented by a single text string. Whereas TF-IDF Vectorization is important since it is a numerical statistic representing the importance of each word in a documment relative to a corpus. TF-IDF will help for accounting the frequency of each word in the specific recipe as well as the rarity of the word across all of the recipes. 

In [4]:
# Converting the ingredients into a text format.
train_data['ingredients_text'] = train_data['ingredients'].apply(lambda x: ' '.join(x))

# Completing TF-IDF Vectorization.
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_data['ingredients_text'])
y_train = train_data['cuisine']


Let's split our training data into training and validation now.

In [5]:
# Splitting the training data into training and validation sets. 
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


### As for the chosen model, out of curiosity since I have never used one before, we're going to go with an ensemble model.
#### Here, we have chosen 3 models often used in text classification problems- Multinomial NB, Logistic Regression, and SVC. Each model has it's strengths and weaknesses, so by combining them in a ensemble model, the goal is to promote the strengths of each one to create a powerful model. 

Multinomial NB is a very simple, fast, and effective model for text classification tasks. However, it assumes independence between the features which is problematic.

On the other hand, Logistic Regression is an incredibly versatile and interpretable model, which can be used for binary and muilticlass classification. It is able to capture linear relationships between features and the target variable. Logistic Regression's weakness comes from struggling with complex non-linear relationships. However, the maximum number of iterations can be increased to improve performance. 

Lastly, SVC's remain to be a powerful choice for a model that can understand complex decision boundaries. They are effective in high-dimensional spaces and when dealing with non-linear relationships. However, with SVC's it may require more computational resources with longer training times. 

By combining each of these models into an ensemble, we can leverage the strengths of each model. Each model will give input on their predictions, and the ensemble essentially combines them to make a final decision. We will set (voting = 'soft') so that each of the predicted class probabilities are averaged. The overarching goal of this, is so that there will be better generalization and robustness when compared to using a singular model. 

In [6]:
# Defining each model we plan to use. 
model1 = MultinomialNB()
model2 = LogisticRegression(max_iter=500) # Setting max iterations to 500, but this can be adjusted accordingly. 
model3 = SVC(probability=True)

Next, we need to fit the three individual models. This is so each model can be trained on the given dataset so they are able to learn patterns and relationships within the data. These will then be used to make predictions on the new, unseen data.

In [7]:
# Fitting the individual models.
model1.fit(X_train_split, y_train_split)
model2.fit(X_train_split, y_train_split)
model3.fit(X_train_split, y_train_split)

Here, we are using the scikit-learn 'VotingClassifier' ensemble methods. This allows individual models to be combined so they are able to make predictions.

In [8]:
# Creating an ensemble of models.
ensemble_model = VotingClassifier(estimators=[('nb', model1), ('lr', model2), ('svm', model3)], voting='soft') # Each string identifier references the corresponding model. 

# Voting is set to 'soft' so predicted class labels are weighted by predicted probabilities. Ultimately, the class with highest sum of weighted probabilities will be chosen as the final prediction.

Now, we are ready to begin training the ensemble model.

In [9]:
# Training the ensemble model.
ensemble_model.fit(X_train_split, y_train_split)

After training, we are able to make predictions on the validation set using the ensemble model.

In [10]:
# Making predictions on the validation set.
ensemble_predictions = ensemble_model.predict(X_val_split)

Great! Now we can evaluate the ensemble model's accuracy.

In [10]:
# Evaluating the ensemble model.
accuracy = accuracy_score(y_val_split, ensemble_predictions)
print(f'Ensemble Model Accuracy: {accuracy}')

Ensemble Model Accuracy: 0.7966059082338152


We obtained a model accuracy of about 0.80. Compared to the predictions using a singular model, this shows a much stronger and robust result. Let's make our predictions on the test data so we can submit to Kaggle for evaluation!

In [12]:
# Loading the test data and setting the file path.
test_file_path = r"C:\Users\alice\Downloads\whats-cooking\test.json\test.json"
test_data = pd.read_json(test_file_path)

We'll need to follow the same preprocessing steps for the test data as we did earlier for the training data.

In [None]:
# Converting ingredients to a text format for both training and test data.
train_data['ingredients_text'] = train_data['ingredients'].apply(lambda x: ' '.join(x))
test_data['ingredients_text'] = test_data['ingredients'].apply(lambda x: ' '.join(x))

# Completing TF-IDF Vectorization on the training data.
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_data['ingredients_text'])
y_train = train_data['cuisine']

# Using the fitted vectorizer to transform the test data.
X_test = tfidf_vectorizer.transform(test_data['ingredients_text'])

We are now ready to make our predictions on the test data.

In [13]:
# Making predictions on our preprocessed test data.
predictions = ensemble_model.predict(X_test)

Lastly, we'll need to create a suitable dataframe for the submission following the sample submission guidelines.

In [14]:
# Creating a df for the submission.
submission_df = pd.DataFrame({'id': test_data['id'], 'cuisine': predictions})

# Saving the submission df to a CSV file and gettng ready to submit on Kaggle.
submission_df.to_csv('C:/Users/alice/Downloads/submission_ensemble.csv', index=False)

Our results on Kaggle show an accuracy of 0.789. These results are reasonably decent, especially in this competition. I think in the future, something that would be good to address is the class imbalance. If attempting this again to improve my model, I would use SMOTE to oversample the minority and see if that would garner any improvement in my score. 