<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_04_BasicStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mastering Statistics With MI6

Do you fancy yourself as a secret agent, quietly maneuvering the shadows of cyberspace, battling unseen adversaries, and protecting the world from potential digital threats? If so, welcome to MI6, your codename is "R", and you report to none other than "Q", the agency's top tech guru. To excel at your new role, however, you'll need to master a seemingly mundane, yet incredibly powerful tool: Statistics.

But what is Statistics? Statistics, contrary to some beliefs, is not a torturous subject designed to give students headaches. In essence, it's a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Whether it's forecasting sales, predicting weather, or even evaluating a basketball player's performance, statistics provide valuable insights by transforming raw data into understandable information.

But you're not here to predict rainfall or dunk hoops, you're here to safeguard digital realms. In your role, you're a data security analyst, using statistics to make sense of vast amounts of information and uncover patterns that could signify potential threats.

Now you might ask, "Why do we need statistics in data science, let alone data security analysis?" Well, data science, in its simplest form, is about extracting insights from data, and statistics is the heart that pumps life into these insights. It's the toolset that allows us to make sense of the immense and often chaotic realm of data.

Let's paint a picture to illustrate. Imagine that MI6 has intercepted gigabytes of encrypted data from a suspected criminal organization. This data could contain anything from harmless cat videos to plans of a cyber attack. It's your job to sift through this data and find what is relevant.

Now, you could manually go through each file (good luck with that!), or you could use statistical methods to analyze metadata - data about the data, like file sizes, modification dates, and file types. By calculating measures such as the average file size or the most common file type, you can quickly identify outliers or patterns. These could be clues pointing towards the data that truly matters.

This is just a taste of how you, as agent "R", can leverage statistics to fulfill your duties at MI6. As we delve deeper into this exciting field, you'll discover more tools and strategies to help you make sense of data and tackle the missions that lie ahead. Stay sharp, agent, and let's begin our statistical journey.

## Case Study Introduction: Your Role in MI6


For those who might not be familiar, let's introduce one of MI6's (the British equivalent of the CIA) most legendary agents: James Bond, also known by his codename, 007. Bond's mission, should he choose to accept it (and he always does), typically involves stopping villains with world-conquering ambitions. While Bond's exploits often involve fast cars, exotic locations, and cutting-edge gadgets, behind the scenes there's an entire team of analysts working diligently to ensure his missions are successful. One such role is that of a data analyst, a position you've just filled. Your codename? Agent "R".

As Agent "R", you're the newest member of MI6's data analysis team. You report directly to "Q", the head of technology and innovation. Your responsibilities mirror those of many entry-level jobs in the field of data science and cybersecurity. You'll be tasked with collecting, cleaning, and analyzing data to provide the agency with valuable insights that will inform their decision-making process. Your analyses could shape everything from the allocation of resources to mission strategies.

Let's get a bit more specific about your duties. You might need to:

1.  **Collect and Process Data:** Just like the police gather evidence to solve a crime, MI6 collects data to prevent threats. This data could be anything from intercepted communication records to satellite images. As part of your job, you'll use a variety of tools to collect, clean, and prepare this data for analysis.

2.  **Analyze Data:** Once the data is prepared, it's time to dig in and uncover any valuable insights. This might involve identifying patterns, making predictions, or testing hypotheses. You'll use statistical techniques to do this, hence why a good knowledge of statistics is essential.

3.  **Visualize and Report Findings:** After the analysis, you'll present your findings in a clear and digestible manner. This often involves visualizing the data in a way that makes sense to others. Remember, your audience may not be well-versed in data analysis, so your ability to communicate complex statistical concepts in a simple way is crucial.

4.  **Inform Decision-Making:** The ultimate goal of your analysis is to assist MI6 in making more informed decisions. Your insights might help determine the potential locations of a villain's lair, predict the likelihood of an attack, or identify a potential weakness in an enemy's defenses.

So why does this role require a good knowledge of statistics? Because statistics is the language we use to understand data. It provides us with the tools to convert raw, messy data into meaningful insights. It allows us to identify patterns, make predictions, and test hypotheses. In short, without statistics, we would be making decisions in the dark.

Are you ready to step into the shoes of Agent "R"? It's a role full of challenges, but also full of opportunities to learn, grow, and contribute to an important cause. Your mission, should you choose to accept it, is to harness the power of statistics to help keep the world safe. Good luck, Agent "R". The world of data science and cybersecurity awaits you.

## Your First Mission: Understanding Data with Descriptive Statistics


Welcome to your first mission, Agent "R". Your task? Dive into the world of Descriptive Statistics. But don't worry, it's not as daunting as it sounds, and by the end of this mission, you'll have a robust set of tools to understand and analyze your data.



**Descriptive Statistics**, as the name suggests, describe and summarize data. In other words, they provide a concise overview of your dataset without getting too bogged down in complex calculations or analyses. Think of them as the 'first glance' at your data, allowing you to quickly understand what you're dealing with.

### Measures of Central Tendency: Mean, Median, Mode


Measures of central tendency help you identify the 'center' or the 'average' of your data. They're like the bullseye on a dartboard, showing you where most of your data is clustering. The three most common measures are the mean, median, and mode.

#### Mean

The mean, often called the average, is calculated by adding up all the values in a dataset and then dividing by the number of values. It's a powerful tool for getting a general sense of a dataset's central value.

Let's imagine you're analyzing the sizes of a batch of encrypted files intercepted from a suspected criminal organization. If the file sizes are 2, 4, 4, 6, and 8 MB, the mean would be (2+4+4+6+8)/5 = 4.8 MB. This tells you that, on average, each file is about 4.8 MB.

For example, you might use the mean in the following scenarios:

1.  Average File Size: Let's say you're tasked with assessing the amount of storage needed for data that MI6 has intercepted. By calculating the mean file size, you can estimate the total storage required.

2.  Average Connection Duration: If you're monitoring network traffic, understanding the average duration of connections could help identify anomalies. A connection that lasts significantly longer than the average might be a sign of a data breach.

3.  Average Password Length: When assessing the security strength of user accounts, you could calculate the average password length. A short average could indicate weak security and lead to recommendations for stronger password policies.

#### Median

The median is the middle value of a dataset when it's arranged in ascending order. If there's an even number of values, the median is the mean of the two middle values.

Let's go back to our file sizes. If they're now 2, 4, 6, 6, and 8 MB, the median is 6 MB since it's the middle value. The median is particularly useful when you have outliers in your data (extreme values that are much higher or lower than the rest) as it's not affected by them.

At MI6, you might use the median in the following situations:

1.  Median Income: Suppose you're analyzing the income levels of individuals in a certain region to understand socio-economic factors that may influence crime. If a few individuals have extremely high income (outliers), using the mean could give a misleading picture. The median would give a more accurate representation of a "typical" individual's income.

2.  Median Number of Daily Cyber Attacks: Cybersecurity threats can vary greatly day by day. If one day experiences an unusually high number of attacks, it can skew the mean. The median would give a more accurate picture of a "normal" day.

3.  Median Login Time: If you're analyzing user login times to a secured system, the median can provide a sense of the "typical" login time, which can be useful in identifying logins that occur at unusual times (potential intruders).

#### Mode

The mode is the most frequently occurring value in a dataset. If our file sizes are 2, 2, 4, 6, and 6 MB, then we have two modes: 2 and 6. If no value repeats, the dataset doesn't have a mode. The mode can be useful for identifying the most common occurrences in your data.

Mode is often used with **categorical** data (as opposed to numerical data).

1.  Most Commonly Used Browser: If you're evaluating the types of web browsers used to access a secured network, the mode can tell you which browser is used most often, helping you focus your security measures. Categories: Chrome, Edge, Firefox, etc.

2.  Most Common Source of Traffic: When monitoring network traffic, identifying the most common source (mode) can be useful in establishing a baseline of normal activity. Here, the categories might in terms of IP address, country of origin, etc.

3.  Most Frequent Attack Type: In cybersecurity, different types of attacks (phishing, malware, brute force, etc.) require different defensive strategies. Identifying the mode can help you prioritize resources.


All these measures of central tendency give you different views of the 'centre' of your data. As Agent "R", it's your job to understand what each measure is telling you and how it can help you in your analysis. Remember, the world of data security is complex and having a good handle on your data can be the difference between mission success or failure. Your journey has just begun, Agent "R". Now, go out there and use those statistics!

## What is that Intern Up to?
One day "Q" comes in and asks you to perform a basic audit of how how much "top secret data" has been downloaded by different people over the last week. (In cybersecurity, large unexpected data transfers can be a sign of trouble. You look, and find the following data:

In [None]:
# Let's create some fake data!
roles = ['Field Agent', 'Data Analyst', 'Intern', 'Field Agent', 'Data Analyst', 'Intern', 'Field Agent', 'Data Analyst', 'Intern', 'Intern']
data_downloaded = [3.5, 5.0, 1.5, 3.7, 5.1, 1.7, 4.0, 5.4, 1.8, 50.0] # Notice the outlier
# Create a DataFrame
df = pd.DataFrame(list(zip(roles, data_downloaded)), columns =['Role', 'Data_Downloaded_GB'])
df # display the data


Unnamed: 0,Role,Data_Downloaded_GB
0,Field Agent,3.5
1,Data Analyst,5.0
2,Intern,1.5
3,Field Agent,3.7
4,Data Analyst,5.1
5,Intern,1.7
6,Field Agent,4.0
7,Data Analyst,5.4
8,Intern,1.8
9,Intern,50.0


Now that we have our data, let's compute the mean, median, and mode.

In [None]:
# Calculate Mean
mean_data = df['Data_Downloaded_GB'].mean()
print("Mean (Data Download): ", mean_data)

# Calculate Median
median_data = df['Data_Downloaded_GB'].median()
print("Median (Data Download): ", median_data)

# Calculate Mode
mode_data = df['Role'].mode()
print("Mode (Role): ", mode_data)

Mean:  8.17
Median:  3.85
Mode:  0    Intern
Name: Role, dtype: object


The `mean()` function computes the mean value of the dataset, `median()` computes the median, and `mode()` computes the mode.

Interpreting the Results:

-   **Mean:** The mean represents the average amount of data downloaded by all roles. Given our dataset, the mean will be quite high due to the unusually high download value of one intern (the outlier). This demonstrates one characteristic of the mean: it's sensitive to extreme values.

-   **Median:** The median represents the middle point of data downloads when sorted in ascending order. Because it only considers the middle value(s), it is not influenced by outliers, making it a more robust measure for central tendency in this scenario.

-   **Mode:** The mode represents the most commonly occurring value in the dataset. In our scenario, since no download value repeats, there is no single mode for data download. However, there IS one for role (with "Intern" being the most common). This is important to note because not all datasets will have a mode.


The fact that the median and mean *differ* so much here is because of the actions of an *outlier.* But what does this mean?

## What is an Outlier?

Outliers are values in a dataset that are significantly different from other observations. They are usually either much higher or much lower than the other values in the set. In our case, one intern's high data download value is an outlier because it's considerably larger than the rest, suggesting it's not in line with the usual behavior.

Outliers can arise for a variety of reasons, ranging from the benign to the serious.

Of course, let's keep the explanations within the standard categories of outliers and add some humor and spy-themed examples:

1.  **Data Entry Errors (The "GoldenFinger" Typo)**: Perhaps our intern, having stayed up late watching James Bond movies, made a typo and accidentally downloaded an entire folder worth of data when they wanted only a file. Instead of dowloading 1GB, they downloaded 100GB. These types of errors are common and can easily cause outlier data.

2.  **Measurement Errors (The "Quantum of Solace" Glitch)**: Maybe our intern was testing a new piece of tech from 'Q', and it went haywire, leading to an unexpected and massive massive data download (or it intefered with the logging systems that records data downloads). This represents a technical error that leads to outliers in the data.

3.  **Natural Outliers (The "Moonraker" Project)**: The intern, entrusted with preparing a report on the top-secret "Moonraker" project, had to download a large dataset, resulting in a one-time spike in their data usage.

4.  **Novelty Outliers (The "Thunderball" Scenario):** Our intern might be secretly training for a role as a field agent. They could be using their downtime to download and study mission briefs from past missions, resulting in unusually high data usage. In reality, novelty outliers are often a result of a new behavior or trend.

5.  **Anomalous Outliers (The "Spectre" Breach)**: The intern's unusually high data download might be a sign of nefarious activities. Could they be a double agent working for Spectre, downloading MI6's secrets for their own purposes? Anomalous outliers often indicate a security breach or fraudulent activity.

6.  **Contextual Outliers (The "Skyfall" Phenomenon)**: Maybe the intern isn't to blame at all. Instead, there could be a rogue AI or a virus within the MI6 system, using the intern's account to download and analyze data. This could represent a contextual outlier, which is an observation that deviates significantly under specific conditions.

As Agent "R", your mission is to figure out why our intern's data usage is an outlier. Is it a simple error, a one-off occurrence, a new trend, or a sign of a more serious issue? Your knowledge of statistics is your best ally in this mission!

## Questions
1.   What is the difference between the mean, median, and mode? Explain how each would be affected in a dataset where Agent "007" downloaded an unusually large amount of data.

2.  Given the following data series representing the number of missions completed by agents: [7, 5, 5, 3, 8, 9, 7, 5], calculate the mean, median, and mode.

3.   In the context of cybersecurity and data analysis, why are outliers important to identify?

4.  Explain how the presence of an outlier, such as an unusually high amount of data downloaded by an intern, might impact the measures of central tendency.

5. Agent "006" has completed 5, 6, 6, 7, and 8 missions in the last 5 months respectively. If they complete 9 missions next month, how would that affect the mean and median?

6.   If you discovered an intern at MI6 was downloading an unusually large amount of data (an outlier), what might be your next steps to investigate this?

7.  Use Python to calculate the median of this data set: [1.7, 2.5, 0.8, 1.6, 2.2]. What does the median tell you in this context?

8.  How could the concept of 'natural outliers' apply to the number of missions an agent completes in a month?

9.  If an intern's data usage was found to be a 'contextual outlier', what might be some potential causes, and how would you investigate further?

10. The following data represents GBs of data downloaded by agents in the past week: [3, 2.5, 2.7, 2.9, 3.1, 2.6, 2.8]. An intern reports downloading 4.2GB of data. Is the intern's data usage an outlier in this case? Explain your answer.

## Answers: Measures of Central Tendency
1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

## Measures of Spread: Diving Deeper into Data Variability

The world of MI6 is unpredictable, and this unpredictability is mirrored in the data we analyze. In any data set, while knowing the average or central value is beneficial, it's equally crucial to understand how much individual data points vary or deviate from this average. This understanding aids in making accurate predictions, identifying anomalies, and ensuring consistency. Measures of spread provide insight into this variability.

###  Range
The **range **of a dataset is a fundamental measure that provides a quick snapshot of its spread by considering only the two extreme values: the smallest and the largest. It's a straightforward yet powerful metric. However, one of its main limitations is that it's heavily influenced by outliers. In MI6 operations, for instance, an unexpected event (like an agent's injury during training) can skew the range, giving a misleading picture of the overall spread.

$$Range=max -  min$$

**Example:** Consider the durations (in hours) it took various agents to complete a special training exercise: 2,3,4,7,3,2.5,10
The range would be 10-2=810-2=8 hours.

In Python, we could calculate this as follows:

In [2]:
data = [2, 3, 4, 7, 3, 2.5, 10]
range_of_data = max(data) - min(data)
print("Range: ", range_of_data)

Range:  8


#### Variance
**Variance**, represented as $s^2$, is a crucial statistical measure that quantifies the spread of data. While the range gives us an idea of how spread out the data is, it only considers the maximum and minimum values. Variance takes this a step further and investigates how far each data point deviates from the mean value. This information can be incredibly valuable in many scenarios, including the operations at MI6.

If we are looking at the performance of MI6 agents, variance can provide insights into the consistency of agent performance. For example, if we consider the accuracy of agents at a shooting range, a high variance in scores would indicate a wide disparity in agent skill levels. In contrast, a low variance would suggest that all agents are more or less performing at a similar level.

Understanding variance can help Q or Agent "R" identify potential issues and take corrective action. For instance, if there is high variance in agent performance, it might indicate that some agents need additional training. On the other hand, low variance might suggest that training methods are effective and producing consistent outcomes.

**Definition:** Variance is defined as the average of the squared differences from the mean. For a dataset $X$ with $N$ values, variance ($s^2$) is given by:

$$
\text{Variance} (s^2) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2
$$

The idea here is to first calculate the mean of the data. Then, for each data point, calculate the square of its deviation from the mean. Finally, find the average of these squared deviations. The squaring is necessary to remove any negative signs (as we don't care whether the data points are above or below the mean) and to give more weight to extreme deviations.

**Example:** Let's consider the accuracy percentages of agents at a shooting range: \[90\%, 85\%, 92\%, 87\%, 88\%\].  In Python, we can calcuate the variance using the **NumPy** library and the `numpy.var()` function.

In [2]:
import numpy as np

accuracy = [90, 85, 92, 87, 88]
variance = np.var(accuracy)
mean = np.mean(accuracy)
print("Variance:",
      round(variance,2))
print("Mean: ",
  round(mean, 2))


Variance: 5.84
Mean:  88.4


Now, let's see what happens if we make the data a bit more spread out. For example, let's suppose our data is [100%, 75%, 92%, 87%, 88%].

In [8]:
accuracy = [100, 75, 92, 87, 88]
variance = np.var(accuracy)
mean = np.mean(accuracy)
print("Variance:",
      round(variance,2))
print("Mean: ",
  round(mean, 2))

Variance: 65.84
Mean:  88.4


As you can see, the mean stay the same, but the variance goes significantly up.

#### Standard Deviation

**Standard deviation**, denoted by $s$, is a powerful statistical measure that builds upon the concept of variance. While variance gives us a sense of how much individual data points deviate from the mean, it does so in squared units. For instance, if our original data were in kilometers or hours, variance would be in square kilometers or square hours, which are challenging to intuitively interpret in most practical applications.

Enter the standard deviation, which is simply the square root of the variance. It maintains all the desirable properties of variance but presents the spread of data in the original units, making it far more intuitive and straightforward to understand.

In the high-stakes world of MI6, knowing the standard deviation of mission outcomes or agent performance can help Q or our aspiring agent "R" understand the expected range of variability. For instance, if we consider the time it takes an agent to complete a mission, a higher standard deviation would mean that the completion times vary widely, which could reflect the mission's unpredictable nature. On the other hand, a lower standard deviation indicates that the mission completion times are relatively consistent, possibly due to well-defined procedures and predictable challenges.

Understanding standard deviation helps Q and agent "R" better anticipate and plan for different scenarios. They can more accurately predict best-case and worst-case scenarios, understand the level of risk involved in each mission, and even identify areas where agent training could be improved to reduce variability in mission outcomes.

**Definition:**  The standard deviation is simply the square root of the variance.
$$
\text{Standard Deviation } (s) = \sqrt(s^2)
$$

**Example:** Continuing with the shooting accuracy data from the variance example, take the square root of the variance to determine the standard deviation. We'll use `numpy.std()` to do this:

In [3]:
accuracy = [100, 75, 92, 87, 88]
variance = np.var(accuracy)
std = np.std(accuracy)
print("Variance:",
      round(variance,2))
print("Standard deviation: ",
  round(std, 2))

Variance: 65.84
Standard deviation:  8.11


### The Art of Distribution in MI6 Missions

In statistics, a **distribution** is a function that shows the possible values for a variable and how often they occur. Think of it as a big-picture view of your data, giving you a sense of where values are concentrated and how they spread out. In the world of data science and MI6 operations, understanding distributions is like having a map that helps you navigate your data and make sense of its patterns.

When Q equips Agent "R" with the latest gadgets, he might be interested in the distribution of their battery life. If the battery life follows a certain distribution, Q can predict how likely it is that the gadget will run out of power during a mission. If Agent "R" monitors enemy communication, understanding the distribution of message frequencies could reveal when they're most active or if there's a pattern to their activities.

More technically, a distribution can be defined as a function that describes the likelihood of a random variable taking on certain values. The distribution can be visualized as a graph, with the possible values of the random variable on the x-axis and the likelihood of these values on the y-axis.

#### Types of Distributions

There are various types of distributions, each with its unique properties and applications. Here are a few common ones:

- **Uniform Distribution:** In a uniform distribution, all values have the same frequency; they are equally likely to occur. If Agent "R" were to toss a fair coin, the result (heads or tails) would follow a uniform distribution.

- **Normal Distribution:**  A normal distribution, also known as a Gaussian or bell curve, is a distribution where data is symmetrically distributed around the mean. The highest point in the distribution is the mean (median and mode too). Many natural phenomena, including heights, weights, and IQ scores, follow a normal distribution. In the context of MI6, if we consider the driving speeds of all agents during a high-speed chase training, it might follow a normal distribution with most agents driving around an average speed, and fewer agents driving much slower or faster.

- **Binomial Distribution:**  A binomial distribution is a distribution of the number of successes in a sequence of independent experiments. Suppose Agent "R" tries to disable a security system, and each attempt is independent and has a fixed probability of success. The number of successful attempts would follow a binomial distribution.

- **Poisson Distribution:** A Poisson distribution models the number of times an event happens in a fixed interval of time or space. If the MI6 tracks the number of cyber attacks they experience monthly, this could follow a Poisson distribution.

Understanding distributions allows us to make inferences and predictions about our data. If we know that data follows a particular distribution, we can calculate the probability of certain outcomes, which can inform decision-making.

For instance, if Q knows that the battery life of a gadget follows a normal distribution, he can calculate the probability that the battery life exceeds a certain value, helping agents plan their missions more effectively. Similarly, if the frequency of enemy messages follows a Poisson distribution, Agent "R" can calculate the probability of receiving a certain number of messages in a day, which could potentially warn of an imminent attack.

In the next sections, we will discuss some of these distributions and their significance in data science and MI6 missions.