<a href="https://colab.research.google.com/github/brendanpshea/logic-prolog/blob/main/StatisticsWithWerewolves.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistics With Werewolves
### A Little More Logical | Brendan Shea, PhD
**Statistics** is the science of data. It involves the processes of collecting, analyzing, interpreting, and presenting numerical information. This field is essential for making informed  decisions based on data, allowing us to extract meaningful patterns and conclusions from a sea of numbers. By using statistics, we can transform raw data into useful insights, providing the ability to forecast trends, test hypotheses, and understand the world in a more data-driven way. In this chapter, statistics will be our tool to unravel the mysteries hidden in the High School Werewolf Dataset.

Our journey through statistics is centered around the High School Werewolf Dataset. In this (fictional) 2,000-student high school students, a number are secretly werewolves. This dataset includes a variety of information, from physical characteristics like height and eye color to academic and behavioral data like GPA and detentions. It's an ideal setting to explore statistical concepts because it combines everyday school life with the extraordinary element of werewolves. The dataset provides a rich, multifaceted context for learning how to analyze and interpret various types of data.

In statistical terms, the entire student body of 2,000 represents our **population**. The population is the complete set of data that we are interested in studying. However, studying a whole population is often impractical, so statisticians use a **sample** - a smaller, manageable part of the population that is (or is thought to be) representative of the whole. In our case, thesample is a 20-student 11th-grade class.

Throughout the chapter, we'll explore various questions that blend statistical analysis with the intriguing theme of werewolves. For example, we might ask if werewolf students have a different average height than their human counterparts, or if there are significant differences in full moon absences between the two groups. We'll also investigate whether eye color can predict werewolf status and explore if detention patterns suggest nocturnal activities typical of werewolves. These questions, while set in a fictional context, are designed to provide real-world insights into how statistics can be used to uncover hidden patterns and tell compelling stories with data.

### Getting to Know Our Data
In this section, we'll introduce you to two powerful tools that will help us explore the High School Werewolf Dataset: Google Colab and Pandas. Don't worry if you're new to programming or feel a bit apprehensive about it; these tools are user-friendly and designed to make your journey into data analysis both enjoyable and insightful.

**Google Colab** is an online platform that allows you to write and execute Python code through your browser. Think of it as a digital notebook that not only lets you write notes but also run code, all in one place. The best part? You don't need to install any software on your computer. All you need is an internet connection and a Google account.

Colab notebooks are interactive, meaning you can write a piece of code, run it, and see the results immediately. This makes it an excellent tool for learning, experimentation, and working with data, especially for beginners. You'll be using Google Colab to access, analyze, and visualize the werewolf dataset, learning hands-on how to manipulate and understand data.

**Pandas** is a library in Python, a programming language, that's specifically designed for data manipulation and analysis. Think of it as a powerful calculator that can not only perform basic arithmetic but also organize and analyze large amounts of data with ease.

With Pandas, we can read our High School Werewolf Dataset, which is stored in a CSV (Comma-Separated Values) file, a common format for storing tabular data. The dataset is available at this link: High School Werewolf Data. Pandas allows us to load this dataset into a DataFrame – a table that's very similar to a spreadsheet you might see in Excel. Once the data is in a DataFrame, we can perform various operations like calculating averages, filtering data based on conditions (like finding all werewolf students), and creating visualizations.

When we combine Google Colab and Pandas, we have a powerful setup for working with our dataset. You'll be able to write and run Python code in Colab to manipulate and analyze the dataset using Pandas, even if you've never written a line of code before.

So, let's start this adventure! We'll begin by opening our dataset in Google Colab and taking our first look at the data using Pandas. As we go through different statistical concepts, you'll see how these tools make data analysis accessible and engaging.


### Step 1: Loading the Dataset into a DataFrame

First, we need to load our dataset into a structure called a DataFrame, which Pandas uses to store and manipulate data in a tabular form. Here's how you can do it:

In [2]:
import pandas as pd

url = 'https://github.com/brendanpshea/logic-prolog/raw/main/high_school_werewolf_data.csv'
school_df = pd.read_csv(url)

You can run this cell by pressing the 'Play' button or use the shortcut Shift + Enter to run the cell. This will execute the code, import Pandas, and load the dataset into a DataFrame named school_df.

##Step 2: Viewing the First Few Rows of the Dataset

Once the dataset is loaded, it's a good practice to view the first few rows. This helps us get an initial feel for the data – the columns, the type of values, and so on. You can do this by using the `head()` function in Pandas. In a new cell, we can type this:

In [4]:
school_df.head()

Unnamed: 0,Sex,Height,EyeColor,FullMoonAbsence,GPA,WerewolfParents,Detentions,IsWerewolf,Homeroom
0,Male,71.79,Blue,5,2.82,0,1,True,10-A
1,Female,62.05,Blue,1,3.67,0,1,False,10-A
2,Male,70.14,Brown,1,3.79,0,0,False,10-A
3,Male,70.83,Green,0,2.62,0,0,False,10-A
4,Male,70.68,Brown,2,2.54,0,6,False,10-A


Running this command will display the first five rows of our dataset.

## Step 3: Creating a Subset of the Data for Homeroom 11-B

Now, let's focus on a specific sample from our dataset – the students in Homeroom "11-B". We'll create a new DataFrame that contains only the data for these students. Here's how you can filter the data, and show the head:

In [5]:
class_df = school_df[school_df['Homeroom'] == '11-B']
class_df.head()

Unnamed: 0,Sex,Height,EyeColor,FullMoonAbsence,GPA,WerewolfParents,Detentions,IsWerewolf,Homeroom
100,Male,76.58,Blue,1,2.11,1,1,False,11-B
101,Female,57.64,Grey,0,2.93,0,8,False,11-B
102,Female,63.87,Green,1,3.69,0,0,True,11-B
103,Female,65.37,Blue,0,3.0,0,1,False,11-B
104,Female,55.76,Green,1,4.0,0,0,False,11-B


This code filters `school_df` to include only the rows where the 'Homeroom' column is '11-B', and stores this subset of data in a new DataFrame called class_df. When you run the `head()` function on class_df, you'll see the first few entries of data for Homeroom 11-B.

Congratulations! You've just loaded your first dataset into Pandas and created a subset of data. These initial steps are crucial in data analysis, as they set the foundation for all the exciting statistical explorations we're about to undertake with our High School Werewolf Dataset.

## Data Dictionary for High School Werewolf Dataset
A **data dictionary** is a document that explains the variables in a dataet.  The dataset simulates a 2,000 student high school with a twist: some students are werewolves. The dataset is designed for educational purposes, allowing students and teachers to explore statistical concepts in a fun, engaging manner.

-   **Sex:** A categorical variable indicating the gender of the student. Possible values are 'Male' and 'Female'.

-   **Height:** A continuous variable representing the student's height in inches. Heights follow a normal distribution. On average, male students are taller than female students, and werewolf students tend to be taller than their non-werewolf peers.

-   **EyeColor:** A categorical variable indicating the eye color of the student. Possible values are 'Brown', 'Blue', 'Green', 'Grey', and 'Yellow'. Yellow eyes are a unique trait found only among werewolves.

-   **FullMoonAbsence:** A discrete variable representing the number of days the student was absent after a full moon. This variable is normally distributed, with werewolves tending to be absent more on such days.

-   **GPA:** A continuous variable representing the student's Grade Point Average. This variable is normally distributed and is the same for werewolves and non-werewolves.

-   **WerewolfParents:** A discrete variable indicating the number of the student's parents who are werewolves. Possible values are 0, 1, or 2, with different base probabilities for werewolf and non-werewolf students.

-   **Detentions:** A discrete variable indicating the number of times the student has been in detention. This follows a Pareto distribution, implying that most students have few detentions, but a few have many. There is no difference in this variable between werewolves and non-werewolves.

-   **IsWerewolf:** A binary variable indicating whether the student is a werewolf or not. Possible values are True (werewolf) or False (non-werewolf).

## Measures of Central Tendency

In statistics, measures of central tendency are used to identify the center of a data set, giving us a representative value that defines the middle of the data distribution. These measures are crucial in summarizing a large set of data with a single value that represents the entire group. In this section, we'll explore three primary measures of central tendency: mean, median, and mode.

The **mean** is the most commonly known measure of central tendency. It is calculated by adding all the values in a data set and then dividing by the number of values. The mean provides a useful overall measure when the data is uniformly distributed without extreme values (outliers).

*Example:* To calculate the mean height of students in our dataset, add all the students' heights together and then divide by the total number of students. If five students have heights in inches of 60, 62, 65, 68, and 70, the mean height is (60 + 62 + 65 + 68 + 70) / 5 = 65 inches.

The **median** is the middle value in a data set when it's arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle values. The median is particularly useful when dealing with data that have outliers, as it is not as affected by them as the mean.

*Example:* To find the median height, sort the heights and pick the middle one. If our heights are 60, 62, 65, 68, and 70 inches, the median is 65 inches (the third value). If there's an additional height of 66 inches, the median is the average of the two middle values: (65 + 66) / 2 = 65.5 inches.

The **mode** is the most frequently occurring value in a data set. A data set may have one mode, more than one mode, or no mode at all. The mode is especially useful for categorical data where we want to know which is the most common category.

*Example:* In determining the mode for eye color in our dataset, if 'Brown' occurs most frequently among the students, then 'Brown' is the mode. If 'Brown' and 'Blue' are equally common, the data set is bimodal, and both colors are modes.

### Why and How to Use Each Measure

Each measure of central tendency gives a different perspective on the data:

-   Use the mean for a quick, general understanding of the dataset, especially when the data distribution is symmetrical without outliers.
-   Use the median to find the middle of the dataset, especially when the data has outliers or is not symmetrically distributed.
-   Use the mode to understand the most common category or value in your dataset, particularly with categorical data.

Understanding these measures helps you analyze datasets like our High School Werewolf Dataset effectively. They provide a simple yet powerful way to summarize and interpret large amounts of data, offering insights that might not be immediately apparent.