Q7 - Statistician's Nightmare

Question: Welcome to the Statistician's Nightmare!
You are given a dataset of various magical creatures and their bizarre daily activities.
The data is filled with quirky statistics and unusual measures.
Your task is to perform statistical analysis to answer the following questions:

- Calculate the mean, median, and mode of the hours each creature spends on different activities.
- Identify the creature with the highest variance in activity hours.
- Determine the correlation between the number of magical spells cast and the hours spent on activities.
- Find outliers in the dataset based on activity hours.
- Perform a hypothesis test to determine if the average hours spent by creatures on "Flying" is different from "Potion Making."

Datasets:

creature_activities: Contains columns (creature_id, creature_name, activity, hours, spells_cast).

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

# Seed for reproducibility
np.random.seed(303)

# Generate synthetic data
creature_ids = np.arange(1, 21)
creature_names = ['Frodo Frog', 'Gimli Gnome', 'Luna Leprechaun', 'Percy Pixie', 'Trevor Troll']
activities = ['Flying', 'Potion Making', 'Spell Casting', 'Herb Gathering', 'Treasure Hunting']
hours_options = np.arange(0, 25)
spells_cast_options = np.arange(0, 101)

data = []
for creature in creature_ids:
    creature_name = np.random.choice(creature_names)
    for activity in activities:
        hours = np.random.choice(hours_options)
        spells_cast = np.random.choice(spells_cast_options)
        data.append([creature, creature_name, activity, hours, spells_cast])

# Create DataFrame
creature_activities = pd.DataFrame(data, columns=['creature_id', 'creature_name', 'activity', 'hours', 'spells_cast'])

# Display the dataset
creature_activities.head()

In [None]:
# Calculate the mean, median, and mode of the hours each creature spends on different activities.
time_by_activity_mean = creature_activities.groupby(['activity'])['hours'].mean().reset_index()
time_by_activity_mean

In [None]:
time_by_activity_median = creature_activities.groupby(['activity'])['hours'].median().reset_index()
time_by_activity_median

In [None]:
time_by_activity_mode = creature_activities.groupby(['activity'])['hours'].aggregate(lambda x: stats.mode(x)[0]).reset_index()
time_by_activity_mode

In [None]:
# Identify the creature with the highest variance in activity hours
variance_hours = creature_activities.groupby(['creature_name'])['hours'].var().reset_index()
highest_variance_creature = variance_hours.loc[variance_hours['hours'].idxmax()]
highest_variance_creature

In [None]:
# Determine the correlation between the number of magical spells cast and the hours spent on activities
correlation = creature_activities[['hours', 'spells_cast']].corr().iloc[0, 1]
correlation

In [None]:
# Find outliers in the dataset based on activity hours using the IQR method
Q1 = creature_activities['hours'].quantile(0.25)
Q3 = creature_activities['hours'].quantile(0.75)
IQR = Q3 - Q1
outliers = creature_activities[(creature_activities['hours'] < (Q1 - 1.5*IQR)) | (creature_activities['hours'] > (Q3 + 1.5*IQR))]
outliers

In [None]:
# Perform a hypothesis test to determine if the average hours spent by creatures on "Flying" is different from "Potion Making"
flying_hours = creature_activities[creature_activities['activity'] == 'Flying']['hours']
potion_hours = creature_activities[creature_activities['activity'] == 'Potion Making']['hours']
t_stat, p_val = stats.ttest_ind(flying_hours, potion_hours)
print(f"T-Statistic: {t_stat}, P-Value: {p_val}")