# Statistics Basics

**Q1) Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.**

-> **Types of Data: Qualitative and Quantitative**
1. Qualitative Data: - It is also known as categorical data, qualitative data represents characteristics or attributes that cannot be measured numerically.

Examples:
Nominal: Types of fruits (apple, banana, cherry).

Ordinal: Education level (high school, bachelor's, master's).


2. Quantitative Data: It is also known as Numerical data that represents quantities and can be measured.

Examples:
Interval: Temperature in Celsius or Fahrenheit, IQ scores. (e.g., 25°C, 30°C).

Ratio: Weight (e.g., 50 kg, 70 kg).

**Scales of Measurement:**
* Nominal: Categories without a specific order (e.g., gender, blood type).
* Ordinal: Ordered categories with no consistent difference between levels (e.g., survey rankings).
* Interval: Ordered data with consistent intervals but no true zero (e.g., calendar years).
* Ratio: Ordered data with consistent intervals and a true zero (e.g., distance, income).

**Q2)  2. What are the measures of central tendency, and when should you use each? Discuss the mean, median, and mode with examples and situations where each is appropriate.**

-> **Measures of Central Tendency**
Measures of central tendency describe the center point of a data set, helping summarize it with a single value. The three most common measures are mean, median, and mode. Each is appropriate in different scenarios based on the nature of the data and its distribution.It is used for symmetric data when precision is needed.

**1. Mean (Arithmetic Average)**

The mean is the sum of all data points divided by the total number of points.
  
Formula: $\text{Mean} = \frac{\text{Sum of all data points}}{\text{Total number of data points}}$

**Example Scenario:**

Calculating the average test score in a class.

For the data set: \( 5, 10, 15, 20, 25 \):

 $\text{Mean} = \frac{5 + 10 + 15 + 20 + 25}{5} = 15 $
  
**2. Median (Middle Value)**

The median is the middle value of a data set when arranged in ascending or descending order. If the number of data points is even, it is the average of the two middle values. It is used for skewed data to reduce the impact of outliers.
  
**Example Scenario:**
Finding the middle salary in a company where a few executives earn significantly more than other employees.
  * For \( 3, 8, 9, 15, 25 \): The median is \( 9 \).
  * For \( 4, 10, 11, 15 \): The median is $(\frac {10 + 11}{2}) = 10.5  $.

**3. Mode (Most Frequent Value)**
The mode is the value(s) that appear most frequently in the data set. A data set can be:
  - **Unimodal:** One mode.
  - **Bimodal:** Two modes.
  - **Multimodal:** More than two modes.
It is used for categorical data or when identifying the most frequent value is important.
  
**Example Scenario:**
Determining the most popular car color sold in a dealership.
  * For ( 2, 3, 3, 7, 8, 8, 8, 10 ) : The mode is ( 8 ).
  * For ( 4, 5, 5, 7, 7, 8 ): The modes are ( 5 ) and ( 7 ) (bimodal).

**Q3)  Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?**

-> **Concept of Dispersion**
Dispersion refers to the spread of data points around a central value (e.g., the mean). It indicates variability in the dataset.  

**Variance**
* Measures the **average squared deviation** from the mean.  
* **Formula:** $\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$
* **Example:** If data points are far from the mean, variance is large, indicating high variability.

**Standard Deviation**
- The **square root of variance**, representing the average deviation from the mean in original units.  
- **Formula:** $\sigma = \sqrt{\sigma^2}$
- **Example:** A small standard deviation means data points are tightly clustered around the mean; a large one means greater spread.

**Q4) What is a box plot, and what can it tell you about the distribution of data?**

-> A box plot (or box-and-whisker plot) is a graphical representation of data that shows its distribution through five key summary statistics:

* **Minimum**: Smallest data point (excluding outliers).  
* **Lower quartile (Q1)**: 25th percentile.  
* **Median (Q2)**: Middle value (50th percentile).  
* **Upper quartile (Q3)**: 75th percentile.  
* **Maximum**: Largest data point (excluding outliers).

It highlights the spread, central tendency, and variability of the data, while also revealing outliers and symmetry or skewness in the distribution.

**Q5)  Discuss the role of random sampling in making inferences about populations.**

-> Random sampling ensures that every individual in a population has an equal chance of selection, making the sample representative. This reduces bias, supports statistical analysis, and allows generalization of results to the entire population, enabling accurate and reliable inferences.

Example: Selecting 200 students randomly to analyze school performance.

**Q6) Explain the concept of skewness and its types. How does skewness affect the interpretation of data?**

->**Skewness** measures the asymmetry of a data distribution. It indicates whether the data is concentrated more on one side of the mean.

**Types of Skewness**:
1. **Positive Skew (Right-Skewed)**:  
   * Tail is longer on the right.  
   * Mean > Median > Mode.  
   * Indicates a concentration of lower values with a few higher outliers.

2. **Negative Skew (Left-Skewed)**:  
   * Tail is longer on the left.  
   * Mean < Median < Mode.  
   * Indicates a concentration of higher values with a few lower outliers.

3. **Symmetrical (Zero Skewness)**:  
   * The data is evenly distributed around the mean.  
   * Mean ≈ Median ≈ Mode.

**Effect on Interpretation**:
* **Central Tendency**: Skewness affects the relationship between the mean, median, and mode.  
* **Outliers**: Skewed data often indicates the presence of outliers, influencing data analysis.  
* **Statistical Models**: Many models assume normality, so skewness may require data transformation or alternative techniques.  


**Q7) What is the interquartile range (IQR), and how is it used to detect outliers?**

->The interquartile range (IQR) is a measure of statistical dispersion that represents the range within which the central 50% of a dataset lies. It is calculated as:

 $\text{IQR} = Q3 - Q1 $

Where:  
- **Q1**: The first quartile (25th percentile).  
- **Q3**: The third quartile (75th percentile).

IQR is robust to extreme values, making it ideal for detecting outliers in skewed data or data with heavy tails.

**Using IQR to Detect Outliers**:

Outliers are values that fall significantly outside the typical range of the data. They can be identified using the following rule:

- **Lower Bound**:  Q1 - 1.5 * IQR  
- **Upper Bound**:  Q3 + 1.5 * IQR

Any data point below the lower bound or above the upper bound is considered an outlier.

**Q8)  Discuss the conditions under which the binomial distribution is used.**

-> The binomial distribution provides the probability of observing a specific number of successes in particular scenarios.The binomial distribution is used under the following conditions:

1. **Fixed Number of Trials (n)**:  
   The process consists of a fixed number of independent trials.

2. **Binary Outcomes**:  
   Each trial has only two possible outcomes, typically labeled as "success" and "failure."

3. **Constant Probability (p)**:  
   The probability of success (\(p\)) remains constant for each trial.

4. **Independent Trials**:  
   The outcome of one trial does not affect the outcome of others.

5. **Discrete Random Variable**:  
   The distribution models the number of successes (\(X\)) in \(n\) trials, where \(X\) is a discrete random variable.

**Examples of Use**:
* Flipping a coin multiple times (heads vs. tails).  
* Testing a batch of items (defective vs. non-defective).  
* Surveying a population for a yes/no response.  

**Q9) Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).**

->**Properties of the Normal Distribution**:
1. **Bell-Shaped Curve**:  
   The normal distribution is symmetric and centered around the mean.

2. **Mean, Median, and Mode are Equal**:  
   These measures of central tendency are identical and located at the center of the distribution.

3. **Symmetry**:  
   The left and right halves of the curve are mirror images.

4. **Defined by Two Parameters**:  
   * **Mean (\( \mu \))**: Determines the center.  
   * **Standard Deviation (\( \sigma \))**: Measures the spread.

5. **Asymptotic**:  
   The tails extend infinitely without touching the horizontal axis.

6. **Area Under the Curve Equals 1**:  
   The total probability for all values in the distribution is 100%.

**Empirical Rule (68-95-99.7 Rule)**:
The empirical rule applies to normal distributions and describes how data is distributed within standard deviations (σ) of the mean (μ)

1. **68% of Data**: Falls within 1 standard deviation $\mu \pm 1\sigma$.  
2. **95% of Data**: Falls within 2 standard deviations $\mu \pm 2\sigma$.
3. **99.7% of Data**: Falls within 3 standard deviations $\mu \pm 3\sigma$.

**Q10) Provide a real-life example of a Poisson process and calculate the probability for a specific event.**

-> Real-Life Example of a Poisson Process:
Scenario:
Suppose a call center receives an average of 5 customer calls per hour. This is a classic Poisson process because:

* Calls occur randomly and independently.
* The average rate of calls (5 per hour) is constant.
* Only one call can occur at any exact moment.
* We want to calculate the probability of receiving exactly 3 calls in an hour.

A **Poisson process** models random events occurring at a constant average rate. For example, if a call center gets 5 calls/hour on average, the probability of exactly 3 calls in an hour is:

$\text{P(X = 3)} = \frac{5^3 e^{-5}}{3!} \approx 0.1404 $

So, there's approximately **0.1404 or 14.04% chance** of receiving 3 calls in an hour.

In [1]:
import math

# Parameters
lmbda = 5  # Average rate (calls per hour)
k = 3      # Number of events (calls)

# Poisson probability formula
P = (lmbda**k * math.exp(-lmbda)) / math.factorial(k)
print(f"Probability of exactly {k} calls: {P:.4f}") #OUTPUT : Probability of exactly 3 calls: 0.1404

Probability of exactly 3 calls: 0.1404


**Q11)Explain what a random variable is and differentiate between discrete and continuous random variables.**

->A **random variable** is a numerical value assigned to the outcome of a random experiment. It can represent the result of rolling dice, measuring temperature, or counting arrivals at a store.

**Types of Random Variables**:
1. **Discrete Random Variable**:
   - Takes on **countable** values (e.g., 0, 1, 2, ...).  
   - Example: Number of heads in 10 coin tosses.  
   - Distribution: Represented by a probability mass function (PMF).

2. **Continuous Random Variable**:
   - Takes on **infinite** values within a range.  
   - Example: Height of students in a class.  
   - Distribution: Represented by a probability density function (PDF).

**Key Difference**:
- **Discrete**: Countable outcomes (finite or infinite).  
- **Continuous**: Measurable outcomes within intervals.  

**Q12)  Provide an example dataset, calculate both covariance and correlation, and interpret the results.**

-> Example of calculation covariance and correlation:

**Example Dataset**:
- \( X = [2, 4, 6, 8, 10] \)  (Study Hours)
- \( Y = [50, 60, 65, 70, 80] \)  (Exam Scores)

**Covariance**  
The covariance is calculated as: $\text{Cov}(X, Y) = 35$

A positive covariance indicates that as study hours increase, exam scores also tend to increase.

**Correlation**  
The correlation is: $\text{r} = 0.99 $

This high correlation suggests a very strong positive linear relationship between study hours and exam scores.


In [2]:
import numpy as np

# Dataset
X = np.array([2, 4, 6, 8, 10])  # Study Hours
Y = np.array([50, 60, 65, 70, 80])  # Exam Scores

# Calculate Covariance
covariance = np.cov(X, Y)[0][1]

# Calculate Correlation
correlation = np.corrcoef(X, Y)[0][1]

# Output the results
print(f"Covariance: {covariance}")
print(f"Correlation: {correlation}")

Covariance: 35.0
Correlation: 0.9899494936611665
