In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv("starbucks-menu-nutrition-drinks.csv")

importing libraries to read starbucks nutritional drinks menu into pandas
tfid vectoriser used to convert the drink names into vectors
consine similarity used to calculate cosine similarity between feature vectors
similarity score used to recommend similar drinks

In [None]:
df.info()

returns following information:

The total number of rows and columns in the dataframe,
The name of each column and its data type,
The number of non-null values in each column,
The amount of memory used by the dataframe

In [None]:
df.columns

This prints a list of all the column names in the DataFrame.

In [None]:
df.head(5)

This prints the first 5 rows of the DataFrame.

In [None]:
df = df.replace('-', 0)

This will replace all the '-' values in the DataFrame with 0.

In [None]:
df['Calories'] = df['Calories'].astype(float)
df['Fat (g)'] = df['Fat (g)'].astype(float)
df['Carb. (g)'] = df['Carb. (g)'].astype(float)
df['Fiber (g)'] = df['Fiber (g)'].astype(float)
df['Protein'] = df['Protein'].astype(float)
df['Sodium'] = df['Sodium'].astype(float)

This will convert the data type of each column to float.

In [None]:
# select only the numeric columns using select_dtypes()
numeric_cols = df.select_dtypes(include=['float64', 'int64'])

# fill missing values in numeric columns with the mean using fillna()
numeric_cols = numeric_cols.fillna(numeric_cols.mean())

# replace the original numeric columns with the filled columns in the original DataFrame
df[numeric_cols.columns] = numeric_cols

This will fill the missing values in the numeric columns with the mean and replace the original numeric columns with the filled columns.

In [None]:
df.columns

returns a list of the column names in the DataFrame

In [None]:
# calculate the mean of non-zero values
mean_value = df[df['Calories'] != 0]['Calories'].mean()

# replace 0 values with the mean using replace()
df['Calories'] = df['Calories'].replace(0, mean_value)

# calculate the mean of non-zero values
mean_value = df[df['Fat (g)'] != 0]['Fat (g)'].mean()

# replace 0 values with the mean using replace()
df['Fat (g)'] = df['Fat (g)'].replace(0, mean_value)

# calculate the mean of non-zero values
mean_value = df[df['Carb. (g)'] != 0]['Carb. (g)'].mean()

# replace 0 values with the mean using replace()
df['Carb. (g)'] = df['Carb. (g)'].replace(0, mean_value)

# calculate the mean of non-zero values
mean_value = df[df['Fiber (g)'] != 0]['Fiber (g)'].mean()

# replace 0 values with the mean using replace()
df['Fiber (g)'] = df['Fiber (g)'].replace(0, mean_value)

# calculate the mean of non-zero values
mean_value = df[df['Protein'] != 0]['Protein'].mean()

# replace 0 values with the mean using replace()
df['Protein'] = df['Protein'].replace(0, mean_value)

# calculate the mean of non-zero values
mean_value = df[df['Sodium'] != 0]['Sodium'].mean()

# replace 0 values with the mean using replace()
df['Sodium'] = df['Sodium'].replace(0, mean_value)

This code replaces 0 values with the mean value of non-zero values in each respective column.

In [None]:
df.shape

This returns a tuple where the first element is the number of rows and the second element is the number of columns in the DataFrame.

In [None]:
df.drop_duplicates(inplace = True)

This code will drop any duplicate rows in the DataFrame and modify the DataFrame in place.

In [None]:
df.shape

In [None]:
# Create feature vector for each drink
tfidf = TfidfVectorizer(stop_words='english')
drink_matrix = tfidf.fit_transform(df['Beverage'])

Using the TfidfVectorizer class from the scikit-learn library to create a feature vector for each drink in the "Beverage" column.

The TfidfVectorizer converts the names of the drinks into a matrix of numerical values. Which weighs each word in each word based on how oftern it appears

stop words english used to clean text before vectorisation

the result is a sparse matrix where most of the values are 0, because each drink name only contains a small subset of words in english

In [None]:
def recommend_drinks(inputs, n=3):
    # Create feature vector for input drinks
    input_matrix = tfidf.transform(inputs)

    # Compute cosine similarity between input drinks and all other drinks
    similarities = cosine_similarity(input_matrix, drink_matrix)

    # Get indices of top n recommended drinks
    indices = similarities.argsort()[0][::-1][:n]

    # Exclude input drinks from the list of recommended drinks
    recommended_drinks = [df.iloc[idx]['Beverage'] for idx in indices if df.iloc[idx]['Beverage'] not in inputs]

    # Return recommended drink names
    return recommended_drinks

This function takes in a list of input drinks and the number of recommended drinks to return (default value is 3). It uses the tfidf object and the drink_matrix created earlier to calculate cosine similarity between the input drinks and all other drinks in the dataset.

The similarity scores are sorted in descending order and the indices of the top n drinks with the highest similarity scores are retrieved.

Then, the function excludes the input drinks from the list of recommended drinks to ensure that the recommended drinks are different from the input drinks. Finally, it returns the names of the recommended drinks as a list.

In [None]:
df.Beverage.unique()

It returns an array of unique values that appear in the 'Beverage' column. Each element of the array represents a unique drink name in the column.

In [None]:
inputs = []
num_inputs = int(input("How many drinks do you want to recommend? "))
for i in range(num_inputs):
    drink = input("Enter a drink name: ")
    inputs.append(drink)

recommended_drinks = recommend_drinks(inputs)
print(recommended_drinks)

Asks the user to input the number of drinks they want to recommend and the names of those drinks. The recommend_drinks function would then take in those drink names as input, create feature vectors for those drinks using the TfidfVectorizer, calculate cosine similarity between the input drinks and all other drinks in the dataset, and return the top recommended drinks based on their similarity to the input drinks.