# Explore the Candy Crush dataset

The dataset contains one week of data from a sample of players who played Candy Crush back in 2014. The data is also from a single episode, that is, a set of 15 levels. It has the following columns:

* player_id: a unique player id
* dt: the date
* level: the level number within the episode, from 1 to 15.
* num_attempts: number of level attempts for the player on that level and date.
* num_success: number of level attempts that resulted in a success/win for the player on that level and date.

The granularity of the dataset is player, date, and level. That is, there is a row for every player, day, and level recording the total number of attempts and how many of those resulted in a win.

In [1]:
# load the dataset and count the number of rows and columns
# df = ...
import pandas as pd
import os

# Load the 'ransom.csv' into a DataFrame
df = pd.read_csv('../Datasets/candy_crush.csv')

In [3]:
# Count and display the number of unique players
# Count unique values in a specific column, e.g., 'A'
unique_count = df['player_id'].nunique()

# Display the result
print(unique_count)


# Display the date range of the data. 
# To check the range of dates in a column of a Pandas DataFrame, you can use the min() and max() functions, just like you would for numerical data. However, you'll need to ensure that the column is of a date or datetime data type.
print(df['dt'].min())
print(df['dt'].max())

6814
2014-01-01
2014-01-07


# Computing level difficulty
Within each Candy Crush episode, there is a mix of easier and tougher levels. Luck and individual skill make the number of attempts required to pass a level different from player to player. The assumption is that difficult levels require more attempts on average than easier ones. That is, the harder a level is, the lower the probability to pass that level in a single attempt is.

A simple approach to model this probability is as a Bernoulli process; as a binary outcome (you either win or lose) characterized by a single parameter $Pwin$: the probability of winning the level in a single attempt. This probability can be estimated for each level as:

$Pwin = \frac{Sum (wins)}{Sum (attempts)}$

For example, let's say a level has been played 10 times and 2 of those attempts ended up in a victory. Then the probability of winning in a single attempt would be pwin = 2 / 10 = 20%.

Now, let's compute the difficulty Pwin separately for each of the 15 levels.

In [9]:
import pandas as pd

# Assuming your DataFrame has columns: 'level', 'attempts', 'success'
min_level = min(df['level'])
max_level = max(df['level'])

levels = []
pwin = []

# Iterate over each level from min_level to max_level
for level in range(min_level, max_level + 1):
    # Filter the DataFrame for the current level
    level_df = df.loc[df['level'] == level]

    # Calculate total attempts and successes for the current level
    total_attempts = len(level_df)
    successes = len(level_df[level_df['success'] == 1])  # Assuming 1 represents success

    # Calculate the ratio of success to attempts (pwin)
    if total_attempts > 0:
        ratio = successes / total_attempts
    else:
        ratio = 0  # To handle cases with zero attempts

    # Store the level and pwin ratio
    levels.append(level)
    pwin.append(ratio)

# Create a DataFrame with the results
pwn_df = pd.DataFrame({'level': levels, 'pwin': pwin})

# Display the result
print(pwn_df)



KeyError: 'success'

In [None]:
# plot the difficulty profile for all levels
# choose the most appropriate type of visualization
from matplotlib import pyplot as plt
...

# Add a horizontal dashed line at y=25
plt.axhline(y=0.1, color='red', linestyle='--')

# Compute Uncertainty
As Data Scientists we should always report some measure of the uncertainty of any provided numbers. Maybe tomorrow, another sample will give us slightly different values for the difficulties! Here we will simply use the Standard error as a measure of uncertainty:

$S_{error} = \frac{σ_{sample}}{\sqrt{n}}$

Here $n$ is the number of datapoints and $S$ sample is the sample standard deviation. For a Bernoulli process, the sample standard deviation is:

$S_{sample} = \sqrt{p_{win}(1-p_{win})}$

Therefore, we can calculate the standard error like this:

$S_{error} = \sqrt{\frac{p_{win}(1-p_{win})}{n}}$

We already have all we need in the difficulty data frame! Every level has been played n number of times and we have their difficulty $p_{win}$. Now, let's calculate the standard error for each level.

In [None]:
# Compute the standard error of p_win for each level

# Create a barplot with standard errors

# Estimating probabilities of winning
One question a level designer might ask is: *How likely is it that a player will reach a level without losing a single time?* 

Let's calculate this using the estimated level difficulties!

Recall that the probability of two independent events happening is simply the product of the individual probabilities. 
So the probability of winning both level 1 and level 2 on the first attempt would be:
$p_win[1] * p_win[2]$

To extend this to all level $Y$ you can use the multiplicatio all the numbers in a vector together

In [None]:
# create a function that given the level calculates the probability of reaching that level without loosing.
import numpy as np
# until level 5
np.prod(pwin[0:5])

# develop a function that allows you to calculate the likelihood up to level $n$