# Mario Kart 8 Deluxe (+DLC) Online Voting
Mario Kart 8 Deluxe is the 6th best selling game of all time[^1] making it the most popular of the Mario Kart franchise. With the addition of the *Mario Kart 8 Deluxe — Booster Course Pass*, the game boasts 96 courses, 48 characters, 41 vehicles, 22 tires, and 15 gliders. For players familiar with older installments of the Mario Kart franchise, this amount of information is overwhelming.

After graduating college, I finally had time to play video games again. While I used to play League of Legends and CS:GO, those games require too much dedicated time and effort to master. Instead, I picked up my favorite game, Mario Kart 8 Deluxe.

It had been a few years since I played and was delighted to find a 48-course expansion which included my favorite track growing up, DK Summit. As I played through the new cups, I felt overwhelmed by the number of courses I was unfamiliar with.


[^1]: Wikipedia - List of Best Selling Video Games

# Do I need to know all 96 courses to play Mario Kart 8 Deluxe (+DLC) online?
**Short answer:** Yes.

**Long answer:** Yes, but you should focus on the most popular courses first.

## Are online courses truly random?
TODO: 
 - Talk about randomness and UX.
 - Mention how the iPod creates artificial randomness.
 - Mention how it *felt* like online courses weren't random and certain courses show up more than others.
 - Explain how confirmation bias can make this *feel* true, but we need to measure to know for sure.

In [144]:
import numpy as np
import pandas as pd

SAMPLE_SIZE = 96  # The number of possible values in the draw. For Mario Kart 8 Deluxe, this is 96.
DRAWS = 313  # The number of times to draw a random value. This is the number of rows in the MK8DX Votes CSV.
NUM_SIMULATIONS = 1000 # The number of samples to take. This is the number of times to run the simulation.

def random_draw(max, size):
  """
  Draw a random integer from 1 to `max` (inclusive) `size` times.

  Parameters:
  max (int): The maximum value of the draw. If `max` is 10, the draw can be up to 10.
  size (int): The number of draws to make. Each draw will be added to a list.

  Returns:
  list: A list of `size` random integers from 1 to `max`.
  """

  rng = np.random.default_rng()
  return rng.integers(1, max + 1, size=size)


def array_frequency_df(arr):
  """
  Count the frequency of each unique value in an array.

  Parameters:
  arr (list): A list of integers.

  Returns:
  pd.DataFrame: A DataFrame with two columns: `value` and `count`. The `value` column contains the unique values in `arr`, and the `count` column contains the frequency of each value.
  """

  value_counts = pd.Series(arr).value_counts()

  df = value_counts.reset_index()
  df.columns = ['value', 'count']
  df = df.set_index('value')

  return df


def insert_statistics_row(df, col):
  """
  Add a row of containing the minimum, maximum, mean, median, and unique values of a Series to a DataFrame.

  Parameters:
  max (int): The maximum value of the draw. If `max` is 10, the draw can be up to 10.
  size (int): The number of draws to make. Each draw will be added to a list.

  Returns:
  The original DataFrame with the new row appended.
  """

  # Calculate the statistics of the column
  min_value = col.min()
  max_value = col.max()
  mean_value = col.mean()
  median_value = col.median()
  unique_values = col.index.nunique()

  # New row to be appended to the DataFrame
  stats_row = pd.DataFrame({
    'min': [min_value],
    'max': [max_value],
    'mean': [mean_value],
    'median': [median_value],
    'unique_values': [unique_values]
  })

  # Drop any columns that are all NaN
  df = df.dropna(axis=1, how='all')
  rf = pd.concat([df, stats_row], ignore_index=True)

  return rf


def simulate_draws(max, draws, samples):

  # Create an empty DataFrame to store the results
  rf = pd.DataFrame(columns=['min', 'max', 'mean', 'median'])

  for i in range(samples):
    picks = random_draw(max, draws)
    df = array_frequency_df(picks)
    rf = insert_statistics_row(rf, df['count'])

  rf['pct_coverage'] = rf['unique_values'] / max * 100

  return rf

rf = simulate_draws(SAMPLE_SIZE, DRAWS, NUM_SIMULATIONS)
rf.to_csv('../data/mk8dx_random.csv', index=False)
rf

Unnamed: 0,min,max,mean,median,unique_values,pct_coverage
0,1,7,3.329787,3.0,94,97.916667
1,1,8,3.597701,3.0,87,90.625000
2,1,9,3.365591,3.0,93,96.875000
3,1,11,3.477778,3.0,90,93.750000
4,1,8,3.365591,3.0,93,96.875000
...,...,...,...,...,...,...
995,1,8,3.556818,3.0,88,91.666667
996,1,8,3.477778,3.0,90,93.750000
997,1,8,3.516854,3.0,89,92.708333
998,1,7,3.439560,3.0,91,94.791667


## Visualize what true randomness would look like in MK8DX online courses

In [141]:
import altair as alt

dist_max = alt.Chart(rf).mark_bar(size=2).encode(
  x=alt.X('mean:Q', scale=alt.Scale(domain=[0, 4])),
  y='count()',
  tooltip=['mean', 'count()']
)

dist_coverage = alt.Chart(rf).mark_bar(size=2).encode(
  x=alt.X('pct_coverage:Q', scale=alt.Scale(domain=[0, 100])),
  y='count()',
  tooltip=['pct_coverage', 'count()']
)

dist_max | dist_coverage


In [143]:
# Assessing the coverage of the data set
print(rf['pct_coverage'].describe(), "\n\n========\n", rf['mean'].describe())

count    1000.000000
mean       96.215625
std         1.848029
min        90.625000
25%        94.791667
50%        96.875000
75%        97.916667
max       100.000000
Name: pct_coverage, dtype: float64 

 count    1000.000000
mean        3.389916
std         0.065688
min         3.260417
25%         3.329787
50%         3.365591
75%         3.439560
max         3.597701
Name: mean, dtype: float64


In [151]:
# Analyzing the distribution of the data set
ONLINE_DF = '../data/mk8dx_online.csv'
mk_df = pd.read_csv(ONLINE_DF)

def series_frequency_df(df_col):
  """
  Count the frequency of each unique value in a DataFrame.

  Parameters:
  df (pd.DataFrame): A DataFrame with one column.

  Returns:
  pd.DataFrame: A DataFrame with two columns: `value` and `count`. The `value` column contains the unique values in `df`, and the `count` column contains the frequency of each value.
  """

  value_counts = df_col.value_counts()

  df = value_counts.reset_index()
  df.columns = ['value', 'count']
  df = df.set_index('value')

  return df

mk_votedcourse = series_frequency_df(mk_df['voted_course'])
mk_votedcourse.to_csv('../data/mk8dx_votedcourses.csv', index=True)

In [153]:
# Visualizing the distribution of the voted courses
df = pd.read_csv('../data/mk8dx_votedcourses.csv')

def filter_outliers(df, filter_col):

  # Calculate the quartiles and IQR
  Q1 = df['count'].quantile(0.25)
  Q3 = df['count'].quantile(0.75)
  IQR = Q3 - Q1

  # Calculate the lower and upper bounds
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  # Filter the DataFrame
  rf = df[(df[filter_col] >= lower_bound) & (df['count'] <= upper_bound)]

  return rf

df = filter_outliers(df, 'count')

alt.Chart(df).mark_bar(size=10).encode(
  x=alt.X('count:Q', scale=alt.Scale(domain=[0, 10])),
  y=alt.Y('count()'),
  tooltip=['count']
)

In [132]:
df.describe()

Unnamed: 0,count
count,93.0
mean,3.365591
std,1.798459
min,1.0
25%,2.0
50%,3.0
75%,5.0
max,8.0


In [133]:
column_sum = df['count'].sum()
print(column_sum)


313


# **Simulated v. Mario Kart Online Randomness**
| Statistic | Random | MK8DX Online |
| ----- | :-----: | :-----: |
| Courses featured (int) | 92 | 93 |
| Courses featured (%) | 96.22% | 96.88% |
| Course frequency average | 3.39 | 3.37 |
| Course frequency variance (std) | 0.07 | 1.80 |

## **Definitions**

### **Courses featured (int)**
**The total number of courses featured.**

For **Random**, this is rounded to the nearest integer.

### **Courses featured (%)**
**The percentage of MK8DX's 96 courses featured.**

For **MK8DX Online**, this is calculated by the following formula:

`[Courses featured (int)] / [Total courses] = [Courses featured (%)]`

This formula does not apply to **Random** as it reflects *courses featured (%)* across 1000 simulations.

### **Course frequency average**
**The average number of times a course is featured in the data.**

The **MK8DX** data excludes *Random* course votes because it's always a course choice for players. This leaves 313 of the total 401 rows in the data set.

The **Random** data reflects 313 draws from a set of 96 courses simulated 1000 times. These parameters allow us to compare statistics between simulated and real-world data.

### **Course frequency variance (std)**
**The standard deviation of course frequency.**

This is the statistical standard deviation of course frequency.

## **Analysis**
Our research question asked if online course voting in Mario Kart 8 Deluxe +DLC (MK8DX) is truly random. We hypothesized that Nintendo, like Apple's iPod shuffle feature, uses artificial randomness because it improves the user experience. We also hypothesized that qualitative observation was an unreliable assessment of randomness due to recency and confirmation bias.

To measure MK8DX's randomness, we collected a random sample of course votes across 105 online races. We then created a dataset which included which courses were featured. To compare our sample to simulated randomness, we created a random draw program that allowed us to specify the sample size, number of draws, and number of simulations.

In order to specify the parameters for our random draw simulator, we first needed to clean our MK8DX data. We created a new dataframe with two columns: *value* and *count*. Value was the course name, and count was the number of times it appeared in the column. This dataframe contained an outlier for the value *Random* because players can always vote for a random course. This outlier was removed, reducing the total in the *value* column from 401 to 313.

Since the cleaned dataset featured 313 random draws, we were finally able to generate a comparable random dataset. The simulation used a sample size of 96 to reflect the number of MK8DX courses, 313 random draws, and ran 1000 simulations. By running 1000 simulations, we are able to asses the probability that MK8DX's course draws are random.

The *random* data featured an average of 92 of 96 online courses across 313 draws. Our *MK8DX* data featured 93 if 96 online courses across 313 draws. This equated to 96.22% and 96.88% coverage respectively. The real world and simulated data are identical by this measure.

When looking at the average number of times a course is featured, the *random* data a averaged 3.39 average frequency and the *MK8DX* data featured a 3.37 average frequency. If we want to get more specific the *random* data had a 3.37 median, which puts our *MK8DX* right at the 50th percentile. Remarkably unremarkable if you ask me!

Also of note, the minimum simulated average frequency was 3.260417 which can be calculated with the following formula:
`[Number of draws] / [Sample size] = [Minimum average frequency]`
`313 / 96 = 3.260417`

In conclusion, we can reject our hypothesis that MK8DX uses artificial randomness for online courses as our sample data fit well within the range of simulated data. When considering Nintendo's user experience reputation along with how artificial randomness is used in music streaming, these results are surprising. However, understanding how difficult it would be to implement artificial randomness for up to 12 online players at once, it makes sense that Nintendo would prioritize more visible features instead.

This research began with a simple question: Do I need to know all 96 MK8DX tracks to play online?

In short, **yes**.

**More specifically**, MK8DX has players vote on one of three randomly selected courses. Players can also vote for a randomly selected course. After players vote, a course is picked at random. If the picked course is a vote for *random*, then a random track is selected. To know if a player needs to know all 96 tracks when playing online, we need to answer the following questions:
  1. Are all 96 courses featured in online play?
  1. Are the courses players vote for random?
  1. Does the number of votes increase a course's selection probability?
  1. Which courses are the most popular?
  1. Should course popularity impact which courses you learn first?

In answering these questions, we can determine which tracks are worth learning in *Time Trial* mode and which aren't worthwhile. We can also begin to analyze trends in player voting bias. This includes answering the following questions:
  1. Does nostalgia impact course popularity?
  1. How might course difficulty course popularity?
  1. Is there a clear hierarchy in course popularity?
  1. Based on our data, is there a statistical *most popular* MK8DX course?


In [None]:
df = pd.read_csv('../data/mk8dx_online.csv')

