# 1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

In [5]:
# Data can be broadly classified into two categories: qualitative (categorical) and quantitative (numerical).

# Qualitative Data: Represents characteristics or qualities and can't be measured numerically.

# Examples:
#   - Eye color (blue, brown, green)
#   - Gender (male, female, other)
#   - Blood type (A, B, AB, O)
#   - Types of fruit (apple, banana, orange)
#   - Country of origin


# Quantitative Data: Represents measurable quantities and can be expressed numerically.Further divided into discrete and continuous.

# Examples:
#    - Height (170 cm, 185 cm)
#    - Weight (65 kg, 72 kg)
#    - Age (25 years, 30 years)
#    - Number of students in a class
#    - Income


# Scales of Measurement:

# 1. Nominal Scale:  Categorizes data into mutually exclusive groups with no inherent order.
#    - Example: Colors of cars (red, blue, green), Types of animals.
#    - Operations: Equality/Inequality only

# 2. Ordinal Scale:  Data is categorized and ranked in a specific order, but the differences between categories are not uniform.
#    - Example: Educational levels (high school, bachelor's, master's, doctorate), Customer satisfaction ratings (very satisfied, satisfied, neutral, dissatisfied, very dissatisfied).
#    - Operations: Equality/Inequality, greater than/less than


# 3. Interval Scale:  Data is ordered, and the differences between values are meaningful and consistent. However, there's no true zero point.
#    - Example: Temperature in Celsius or Fahrenheit, Years (0 AD does not mean absence of time).
#    - Operations: All the operations on ordinal scale + addition/subtraction

# 4. Ratio Scale:  Data is ordered, the differences are meaningful and consistent, and there is a true zero point. This means zero represents the absence of the attribute being measured.
#    - Example: Height, weight, age, income, number of items.
#    - Operations: All previous operations + multiplication/division




# 2. What are the measures of central tendency, and when should you use each? Discuss the mean, median, **and** mode with examples and situations where each is appropriate.

In [6]:
# Measures of Central Tendency: Mean, Median, and Mode

# Mean: The average of a dataset.  It's calculated by summing all values and dividing by the number of values.

# When to use:  Appropriate for data that is normally distributed (bell-shaped curve) and without significant outliers. Outliers heavily influence the mean.

# Example:  Calculating the average test score of a class.  If the scores are generally clustered around the average, the mean provides a good representation of the typical score.


# Example calculation:
def calculate_mean(data):
  if not data:  #Handle empty list
    return 0
  return sum(data) / len(data)


data = [10, 20, 30, 40, 50]
mean = calculate_mean(data)
print(f"Mean of data is {mean}")

# Median: The middle value when the dataset is ordered. If there's an even number of values, the median is the average of the two middle values.

# When to use:  More robust to outliers than the mean.  Useful when the data is skewed (not symmetric) or contains extreme values.

# Example:  Analyzing house prices in a neighborhood.  A few extremely expensive houses could skew the mean, but the median would represent a more typical price.

# Example Calculation
def calculate_median(data):
  data.sort()
  n = len(data)
  if n % 2 == 0: # even number of elements
    mid1 = data[n // 2 - 1]
    mid2 = data[n // 2]
    median = (mid1 + mid2) / 2
  else:
    median = data[n // 2]
  return median

data = [10, 20, 30, 40, 50, 1000]
median = calculate_median(data)
print(f"Median of the data is {median}")



# Mode: The value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more.

# When to use:  Useful for categorical data or when looking for the most common value in a dataset.  Not very informative for continuous data as each value may occur only once.

# Example: Finding the most common color of car sold.


# Example Calculation
from collections import Counter

def calculate_mode(data):
  count = Counter(data)
  max_count = max(count.values())
  mode = [k for k, v in count.items() if v == max_count]
  return mode


data = [10, 20, 30, 20, 20, 40, 50]
mode = calculate_mode(data)
print(f"Mode of data is {mode}")


Mean of data is 30.0
Median of the data is 35.0
Mode of data is [20]


# 3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

In [7]:
# Dispersion AKA variability or spread in statistics describes how spread out the data points in a dataset are.  It measures how much the data values deviate from the central tendency (mean, median, or mode). A high dispersion indicates that the data points are widely scattered, while low dispersion suggests they are clustered closely around the central value.

# Variance and standard deviation are the two most common measures of dispersion.

# Variance:  The average of the squared differences from the mean.  It gives a sense of how much individual data points deviate from the mean value.  Squaring the differences ensures that both positive and negative deviations contribute positively to the total spread.  However, variance is expressed in squared units, which may not be directly interpretable in the context of the original data.

# Example Calculation
import math
def calculate_variance(data):
    n = len(data)
    if n == 0:
        return 0  # Handle empty list
    mean = sum(data) / n
    variance = sum([(x - mean)**2 for x in data]) / n
    return variance

data = [10, 20, 30, 40, 50]
variance = calculate_variance(data)
print(f"Variance of data is {variance}")


# Standard Deviation: The square root of the variance. It is expressed in the same units as the original data, making it easier to interpret in the context of the problem. The standard deviation provides a measure of the typical distance between data points and the mean.  A larger standard deviation means more spread-out data.

# Example Calculation
def calculate_standard_deviation(data):
    variance = calculate_variance(data)
    std_dev = math.sqrt(variance)
    return std_dev

data = [10, 20, 30, 40, 50]
std_dev = calculate_standard_deviation(data)
print(f"Standard deviation of data is {std_dev}")

# In summary, variance and standard deviation both quantify the spread of data around the mean. Standard deviation is generally preferred as it shares the same unit as the data, facilitating easier interpretation.

Variance of data is 200.0
Standard deviation of data is 14.142135623730951


# 4. What is a box plot, and what can it tell you about the distribution of data?

In [8]:
# A box plot (or box-and-whisker plot) is a graphical representation of the distribution of numerical data. It visually displays the five-number summary of a dataset:

# 1. Minimum: The smallest value in the dataset (excluding outliers).
# 2. First Quartile (Q1 or 25th percentile): The value below which 25% of the data falls.
# 3. Median (Q2 or 50th percentile): The middle value of the dataset.
# 4. Third Quartile (Q3 or 75th percentile): The value below which 75% of the data falls.
# 5. Maximum: The largest value in the dataset (excluding outliers).

# The box in the box plot represents the interquartile range (IQR), which is the difference between the third and first quartiles (IQR = Q3 - Q1).  It contains the middle 50% of the data. A line inside the box indicates the median.  The "whiskers" extend from the box to the minimum and maximum values, but typically, they do not extend beyond 1.5 times the IQR from the quartiles. Data points beyond the whiskers are considered outliers and are often plotted as individual points.

# What a box plot tells you about the distribution of data:

# 1. Central Tendency: The median line within the box gives an idea of the central value of the data.
# 2. Spread/Dispersion: The length of the box (IQR) shows the spread of the middle 50% of the data. Longer boxes indicate more variability, while shorter boxes indicate less variability. The whiskers further illustrate the overall range of the data.
# 3. Skewness: The position of the median within the box can indicate skewness. If the median is closer to Q1, the data is right-skewed (positive skew), with a longer right tail. If the median is closer to Q3, the data is left-skewed (negative skew), with a longer left tail.  A perfectly symmetrical distribution will have the median in the center of the box.
# 4. Outliers:  Outliers are clearly visible as individual points plotted beyond the whiskers.  They can indicate unusual observations or errors in the data.  They can significantly affect the mean but have less impact on the median.
# 5. Comparisons: Box plots are excellent for comparing the distribution of multiple datasets side by side. You can quickly visually compare their central tendencies, spreads, and shapes.

# Example:

# Imagine two datasets representing the test scores of two different classes.
# A box plot can reveal whether one class has higher scores on average (comparing medians),
# whether scores are more spread out in one class (comparing IQRs), or if either class has unusually high or low scores (outliers).

# 5. Discuss the role of random sampling in making inferences about populations.




In [9]:

# Random sampling is crucial for making accurate inferences about populations because it helps to ensure that the sample is representative of the population from which it's drawn.  If a sample is not representative, conclusions drawn from it may not accurately reflect the characteristics of the entire population.

# Here's a breakdown of its role:

# 1. Reducing Bias: Random sampling minimizes bias by giving every member of the population an equal chance of being selected. Non-random sampling methods, like convenience sampling or voluntary response sampling, can lead to samples that overrepresent certain groups and underrepresent others, introducing systematic errors into the results.

# 2. Generalizability:  A randomly selected sample allows researchers to generalize their findings to the larger population with a known degree of confidence.  The larger and more random the sample, the more confident we can be in the generalizability of our results.

# 3. Statistical Inference: Random sampling is a fundamental requirement for many statistical methods used to make inferences about populations. Statistical tests rely on the assumption that the sample is a random representation of the population to calculate probabilities and draw conclusions.  For instance, confidence intervals and hypothesis testing assume random sampling.

# 4. Estimation of Population Parameters:  Random samples provide a basis for estimating population parameters (like the population mean or proportion) with a certain margin of error.  Random sampling allows us to quantify the uncertainty associated with these estimates.  We can calculate confidence intervals that express a range within which the true population parameter is likely to fall.

# 5. Evaluating Treatment Effects (in experiments): In experimental designs, random assignment of participants to treatment and control groups is crucial. This random assignment helps ensure that any observed differences between groups are due to the treatment and not pre-existing differences between the groups.

# Example: Suppose you want to estimate the average income of all households in a city.  A random sample of households from across the city would provide a more accurate estimate than simply surveying households in a wealthy neighborhood.  The random sample reduces the likelihood of overrepresenting high-income earners and underrepresenting low-income earners, leading to a more accurate estimate of the city's overall average income.

# In summary: Random sampling is a cornerstone of sound statistical practice. It allows researchers to collect data that is representative of the population of interest, reducing bias and enabling accurate inferences about that population.

# 6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

In [10]:
# Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.  In simpler terms, it describes the lack of symmetry in a dataset's distribution. A perfectly symmetrical distribution (like a normal distribution) has a skewness of zero.

# Types of Skewness:

# 1. Positive Skew (Right Skew):  The tail on the right side of the distribution is longer or fatter than the left side.  The mean is typically greater than the median, which is greater than the mode.  This indicates that there are some relatively high values that pull the mean to the right.

# Example: Income distribution.  Most people have moderate incomes, but a small number of individuals have very high incomes, creating a long tail to the right.

# 2. Negative Skew (Left Skew): The tail on the left side of the distribution is longer or fatter than the right side.  The mean is typically less than the median, which is less than the mode.  This indicates that there are some relatively low values pulling the mean to the left.

# Example:  Test scores where most students perform well, but a few students score very low.


# 3. Zero Skew:  The distribution is perfectly symmetrical.  The mean, median, and mode are all equal.

# Example:  A perfectly normal distribution.


# How Skewness Affects Interpretation:

# 1. Misleading Central Tendency:  In skewed distributions, the mean can be a misleading measure of central tendency because it is influenced by outliers. The median is often a better representation of the typical value in skewed data.

# 2. Impact on Statistical Tests:  Many statistical tests assume normality (zero skew).  If the data is significantly skewed, the results of these tests may be unreliable. Transformations (like taking the logarithm of the data) can sometimes be applied to reduce skewness and make the data more suitable for these tests.


# 3. Understanding Data Distribution: Skewness provides insights into the shape of the data distribution.  Understanding the shape is important for selecting appropriate statistical methods and drawing meaningful conclusions. For instance, a long tail on one side suggests the presence of outliers or extreme values.

# 4. Business Decisions: In business applications, skewness can significantly affect decision-making. For example, in sales forecasting, a right-skewed distribution of sales might indicate a higher chance of unusually high sales, while a left skew might indicate a risk of lower sales.

# Example (Illustrative -  you'd typically visualize this with a histogram):

# Imagine a dataset of house prices.  If there are a few extremely expensive mansions, the distribution will be right-skewed.  The mean house price would be higher than the median due to the influence of these expensive outliers.  In this case, the median price would be a more representative measure of the typical house price.  If we used the mean to estimate the "average" price, our estimate would be distorted upwards.


7. What is the interquartile range (IQR), and how is it used to detect outliers?

In [11]:
# The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.

# How IQR is used to detect outliers:

# 1. Calculate the IQR: Find the difference between the third quartile (Q3) and the first quartile (Q1) of your dataset.

# 2. Determine the outlier boundaries:
#    - Lower bound: Q1 - 1.5 * IQR
#    - Upper bound: Q3 + 1.5 * IQR

# 3. Identify outliers: Any data point that falls below the lower bound or above the upper bound is considered a potential outlier.


# Example:
def calculate_iqr_and_outliers(data):
    data.sort()
    n = len(data)
    if n < 4:  # Not enough data for quartiles
        return None, []

    q1_index = (n + 1) // 4 -1 # index of the first quartile
    q3_index = (3 * (n + 1)) // 4 - 1 # index of the third quartile

    q1 = data[q1_index]
    q3 = data[q3_index]

    iqr = q3 - q1

    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    outliers = [x for x in data if x < lower_bound or x > upper_bound]

    return iqr, outliers

data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000]  # Example with an outlier
iqr, outliers = calculate_iqr_and_outliers(data)

print(f"IQR is {iqr}")
print(f"Outliers are: {outliers}")


data2 = [10,12,15,18,20,21,22,25,28] # Example without outlier
iqr, outliers = calculate_iqr_and_outliers(data2)
print(f"IQR is {iqr}")
print(f"Outliers are: {outliers}")


IQR is 60
Outliers are: [1000]
IQR is 10
Outliers are: []


# 8. Discuss the conditions under which the binomial distribution is used.

In [12]:
# The binomial distribution is a discrete probability distribution that describes the probability of obtaining exactly k successes in n independent Bernoulli trials, where each trial has the same probability of success, p.

# Conditions for using the binomial distribution:

# 1. Fixed number of trials (n):  The experiment consists of a fixed number of trials.  For example, flipping a coin 10 times or observing 100 light bulbs to see if they are defective.

# 2. Independent trials: The outcome of each trial is independent of the others. The result of one coin flip does not affect the outcome of subsequent flips.  This independence is crucial.

# 3. Two possible outcomes (success/failure): Each trial can only result in one of two outcomes: success or failure.  Examples: Heads or tails, defective or not defective, pass or fail a test.

# 4. Constant probability of success (p): The probability of success (p) is the same for each trial. When flipping a fair coin, the probability of getting heads is always 0.5 on each individual flip.

# Examples where the binomial distribution applies:

# * Number of heads in 20 coin flips (n=20, p=0.5, assuming a fair coin).
# * Number of defective light bulbs in a sample of 50 (n=50, p = probability of a defective bulb).
# * Number of successful free throws in 10 attempts by a basketball player (n=10, p = player's free throw percentage).
# * Number of correctly answered questions on a 20-question multiple-choice test (n=20, p= probability of correctly answering a question).

# When the conditions are not met, other distributions (e.g., Poisson, hypergeometric, or negative binomial) might be more appropriate.

# 9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

In [13]:
# Properties of the Normal Distribution:

# 1. Symmetrical: The normal distribution is perfectly symmetrical around its mean (μ).  This means that the left and right sides of the curve are mirror images of each other.

# 2. Bell-shaped: The distribution has a characteristic bell shape.  The highest point of the curve corresponds to the mean, median, and mode (which are all equal in a normal distribution).

# 3. Defined by Mean (μ) and Standard Deviation (σ): The normal distribution is completely determined by its mean (μ), which represents the center of the distribution, and its standard deviation (σ), which measures the spread or variability of the data.  A larger standard deviation results in a wider, flatter curve, while a smaller standard deviation leads to a taller, narrower curve.

# 4. Asymptotic Tails: The tails of the normal distribution extend infinitely in both directions, approaching but never touching the horizontal axis.  This means that there's a theoretical possibility, however small, of observing extremely high or low values.

# 5. Empirical Rule (68-95-99.7 Rule):  This rule describes the proportion of data that falls within a certain number of standard deviations from the mean:

#    * Approximately 68% of the data falls within one standard deviation of the mean (μ ± σ).
#    * Approximately 95% of the data falls within two standard deviations of the mean (μ ± 2σ).
#    * Approximately 99.7% (almost all) of the data falls within three standard deviations of the mean (μ ± 3σ).

# Significance of the Empirical Rule:

# * Data Interpretation: The empirical rule provides a quick way to understand the distribution of data. It helps us visualize where most of the data points lie and identify potential outliers.
# * Outlier Detection: Values that fall outside of three standard deviations from the mean are considered unusual or outliers.
# * Probability Estimation:  The empirical rule allows for a rough estimation of probabilities associated with different ranges of values within a normal distribution.
# * Real-World Applications: Many natural phenomena and measurement errors follow a normal distribution, making the empirical rule a valuable tool in various fields, including statistics, engineering, finance, and medicine.


# Example:

# Let's say the height of adult women follows a normal distribution with a mean of 5'4" and a standard deviation of 2".  Using the empirical rule:

# * About 68% of women are between 5'2" and 5'6" tall (5'4" ± 2").
# * About 95% of women are between 5'0" and 5'8" tall (5'4" ± 2 * 2").
# * About 99.7% of women are between 4'8" and 6'0" tall (5'4" ± 3 * 2").

 # 10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

In [14]:
import math

def poisson_probability(k, lam):
  """Calculates the probability of k events in a Poisson process with rate parameter lambda."""
  return (lam**k * math.exp(-lam)) / math.factorial(k)

# Example: Website Traffic
# Suppose a website receives an average of 5 visitors per minute.
# What is the probability that exactly 3 visitors arrive in a given minute?

# Define the rate parameter (lambda): average number of visitors per minute
lam = 5

# Define the desired number of events (k): exactly 3 visitors
k = 3

# Calculate the probability
probability = poisson_probability(k, lam)
print(f"The probability of exactly 3 visitors arriving in one minute is: {probability:.4f}")


# Another example: Car Accidents
# On average 2 accidents occur per day at a particular intersection
# what is the probability of observing 4 accidents at the intersection today

lam = 2
k = 4
probability = poisson_probability(k, lam)
print(f"The probability of observing 4 accidents today is {probability:.4f}")

The probability of exactly 3 visitors arriving in one minute is: 0.1404
The probability of observing 4 accidents today is 0.0902


# 11. Explain what a random variable is and differentiate between discrete and continuous random variables.

In [15]:
# A random variable is a variable whose value is a numerical outcome of a random phenomenon.  It's a function that maps the outcomes of a random experiment to numerical values.

# Discrete Random Variable:

# A discrete random variable can only take on a finite number of values or a countably infinite number of values. These values are often integers, but they don't have to be.  The key is that there are gaps between the possible values.

# Examples:

# * Number of heads when flipping a coin four times (can be 0, 1, 2, 3, or 4).
# * Number of cars passing a certain point on a highway in an hour (0, 1, 2, 3,...).
# * Number of defective items in a batch of 100 (0, 1, 2,...100).


# Continuous Random Variable:

# A continuous random variable can take on any value within a given range or interval.  There are no gaps between the possible values.  Typically, continuous random variables represent measurements.

# Examples:

# * Height of a student.
# * Weight of an object.
# * Temperature of a room.
# * Time taken to complete a task.

# Key Differences:

# 1. Possible Values:  Discrete variables have distinct, separate values, while continuous variables can take on any value within a range.

# 2. Probability:  The probability of a discrete random variable taking on a specific value is a non-zero number.  The probability of a continuous random variable taking on a specific value is zero.  Instead, we talk about the probability that a continuous random variable falls within a certain interval.

# 3. Visualization:  The probability distribution of a discrete variable is often represented by a probability mass function (PMF), while the probability distribution of a continuous variable is represented by a probability density function (PDF).

# In summary, the difference between discrete and continuous random variables lies in the nature of the possible values and how we describe the probability of those values occurring.

# 12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

In [16]:
import numpy as np

# Example dataset: Hours studied vs. Exam score
hours_studied = np.array([2, 5, 3, 8, 1, 6, 4, 7])
exam_score = np.array([60, 80, 70, 90, 50, 85, 75, 92])

# Calculate covariance
covariance = np.cov(hours_studied, exam_score, ddof=0)[0, 1]  # ddof=0 for population covariance
print(f"Covariance: {covariance}")


# Calculate correlation
correlation = np.corrcoef(hours_studied, exam_score)[0, 1]
print(f"Correlation: {correlation}")

# Interpretation:
# Covariance: A positive covariance (like the one calculated here) suggests a positive relationship between hours studied and exam score, meaning as one increases, the other tends to increase.  However, the magnitude of the covariance is difficult to interpret on its own because it depends on the scales of the variables.

# Correlation:  The correlation coefficient ranges from -1 to +1.  A correlation close to +1 (like we have here) indicates a strong positive linear relationship. This means there's a strong tendency for students who study more hours to get higher exam scores. The closer to 1, the stronger the positive linear relationship.

Covariance: 30.625
Correlation: 0.9717403276586778
