# Statistics Advance-2

## Q1: What are the Probability Mass Function (PMF) and Probability Density Function (PDF)? Explain with an example.

The Probability Mass Function (PMF) and Probability Density Function (PDF) are mathematical functions used to describe the probabilities associated with specific values of a discrete or continuous random variable, respectively.

1. **Probability Mass Function (PMF)**:

    - The PMF is used for discrete random variables, which take on distinct, separate values.
    - It gives the probability of a random variable taking on a specific value.
    - The PMF is defined for each possible value of the random variable.
    - The sum of probabilities over all possible values must equal 1.

2. **Probability Density Function (PDF)**:

    - The PDF is used for continuous random variables, which can take on any value within a range.
    - It gives the probability of a random variable falling within a specific range of values.
    - The PDF is defined for intervals or ranges of values, not specific values.
    - The area under the PDF curve over a given interval represents the probability of the random variable falling within that interval.


Here's an example of both a PMF and a PDF:

Example: Tossing a Fair Six-Sided Die

1. **PMF (Discrete)**:

    - A six-sided die is a discrete random variable, as it can take on the values {1, 2, 3, 4, 5, 6}.

    - The PMF for this die assigns equal probabilities to each of these values, as it's a fair die. So, the PMF is defined as:

        - P(X = 1) = 1/6
        - P(X = 2) = 1/6
        - P(X = 3) = 1/6
        - P(X = 4) = 1/6
        - P(X = 5) = 1/6
        - P(X = 6) = 1/6

    - The sum of all these probabilities equals 1.

2. **PDF (Continuous)**:

    - Consider a continuous random variable, such as the height of individuals in a population.
    - The height can take on any real value within a certain range.
    - The PDF for this continuous random variable would describe the likelihood of a person's height falling within a specific range. For example, it might look like a bell-shaped curve, similar to the normal distribution.
    - If you wanted to find the probability of a person's height falling between 160 cm and 170 cm, you would calculate the area under the PDF curve over that interval.

## Q2: What is Cumulative Density Function (CDF)? Explain with an example. Why CDF is used?

The **Cumulative Distribution Function (CDF)** is a mathematical function that provides the cumulative probability that a random variable takes on a value less than or equal to a specific value. In other words, it gives you the probability that the random variable is less than or equal to a given value. The CDF is defined for both discrete and continuous random variables.

The CDF is denoted as F(x) for a given value x and is defined as follows:

1. For a discrete random variable:
    - F(x) = P(X ≤ x), where X is the random variable.

2. For a continuous random variable:
    - F(x) = ∫[a, x] f(t) dt, where f(t) is the Probability Density Function (PDF) of the continuous random variable, and the integral is taken from some reference point 'a' to the value 'x'.


Here's an example to illustrate the CDF:

**Example: Rolling a Six-Sided Die (Discrete Random Variable)**

Let's say you want to find the CDF for rolling a fair six-sided die. The die can take values {1, 2, 3, 4, 5, 6} with equal probability. The CDF would look like this:

- F(1) = P(X ≤ 1) = 1/6 (the probability of getting 1 or less)
- F(2) = P(X ≤ 2) = 2/6 (the probability of getting 2 or less)
- F(3) = P(X ≤ 3) = 3/6 (the probability of getting 3 or less)
- F(4) = P(X ≤ 4) = 4/6 (the probability of getting 4 or less)
- F(5) = P(X ≤ 5) = 5/6 (the probability of getting 5 or less)
- F(6) = P(X ≤ 6) = 1 (the probability of getting 6 or less, which is certain)


The CDF provides a way to answer questions like, "What is the probability of rolling a number less than or equal to 3 on the die?" In this case, you would look up F(3) to find the answer (which is 3/6 or 1/2).

**Why CDF is used:**

The CDF is useful for several reasons:

1. **Evaluating Probability:** It allows you to calculate the probability that a random variable falls within a given range or is less than a specific value.

2. **Describing Distribution Properties:** It helps in understanding the characteristics of a probability distribution, such as where the majority of the data is concentrated and how spread out it is.

3. **Comparison:** It facilitates the comparison of different random variables or distributions by examining their cumulative probabilities.

4. **Quantiles:** CDFs are used to find quantiles, which are points that divide the distribution into specified percentiles (e.g., median, quartiles).

5. **Statistical Testing:** CDFs are used in statistical tests, hypothesis testing, and assessing the fit of data to a particular distribution.

## Q3: What are some examples of situations where the normal distribution might be used as a model? Explain how the parameters of the normal distribution relate to the shape of the distribution.

The normal distribution, also known as the Gaussian distribution, is a widely used probability distribution in statistics and has numerous applications in modeling various real-world situations. The parameters of the normal distribution are the mean (μ) and the standard deviation (σ), and they play a crucial role in defining the shape of the distribution. Here are some examples of situations where the normal distribution might be used as a model:

1. **Height of Individuals:**

The heights of individuals in a large population often follow a normal distribution. The mean height (μ) represents the average height, and the standard deviation (σ) indicates the spread or variability in heights.

2. **Test Scores:**

In educational testing, scores on standardized tests like the SAT or IQ tests are often assumed to be normally distributed. The mean (μ) represents the average score, and the standard deviation (σ) measures the dispersion of scores around the mean.

3. **Errors in Measurements:**

In experimental science, measurement errors often approximate a normal distribution. The mean (μ) is typically considered to be zero, and the standard deviation (σ) quantifies the precision and accuracy of measurements.

4. **Financial Returns:**

In finance, daily or monthly returns on investments (e.g., stocks) are often assumed to be normally distributed. The mean (μ) represents the expected return, and the standard deviation (σ) measures the volatility or risk associated with the investment.

5. **Quality Control:**

In manufacturing and quality control processes, the distribution of product measurements (e.g., the diameter of manufactured parts) is often approximated by a normal distribution. The mean (μ) represents the target value, and the standard deviation (σ) helps identify the acceptable variation.

6. **Biological Traits:**

Biological traits such as birth weights, blood pressure, and body temperatures in a population can be modeled using a normal distribution. The mean (μ) represents the typical value for the trait, and the standard deviation (σ) represents the variation.


The parameters μ and σ relate to the shape of the normal distribution as follows:

1. **Mean (μ):**

    - The mean determines the center or peak of the normal distribution.
    - Shifting μ to the left or right moves the entire distribution along the horizontal axis.
    - Changing μ does not affect the spread or width of the distribution; it only shifts the distribution left or right.
2. **Standard Deviation (σ):**

    - The standard deviation determines the spread or width of the normal distribution.
    - A larger σ results in a wider and flatter distribution, indicating greater variability.
    - A smaller σ results in a narrower and taller distribution, indicating less variability.

## Q4: Explain the importance of Normal Distribution. Give a few real-life examples of Normal Distribution.

The normal distribution, also known as the Gaussian distribution, is of great importance in statistics and data analysis for several reasons:

1. **Widespread Applicability:** The normal distribution is a versatile and widely applicable probability distribution that can model a wide range of real-world phenomena. It is often used as a default distribution for continuous data due to its mathematical properties and simplicity.

2. **Central Limit Theorem:** The normal distribution plays a fundamental role in the Central Limit Theorem. This theorem states that the distribution of the sample means of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the original population distribution. This property is essential in statistical inference and hypothesis testing.

3. **Statistical Inference:** Many statistical methods and hypothesis tests are based on the assumption that data follow a normal distribution. These methods include t-tests, analysis of variance (ANOVA), and regression analysis. Deviations from normality can affect the validity of these tests.

4. **Parameter Estimation:** The normal distribution provides maximum likelihood estimates for the mean and variance of a dataset, making it a valuable tool for estimating population parameters.

5. **Data Transformation:** In cases where data do not follow a normal distribution, statistical techniques often involve transforming the data to make it more approximately normal. Common transformations include logarithmic, square root, and Box-Cox transformations.

Real-life examples of situations where the normal distribution is observed include:

1. **Human Characteristics:** Traits like height, weight, IQ scores, and blood pressure in large populations tend to follow a normal distribution. For example, heights of adults in a given region are often approximately normally distributed.

2. **Error and Noise:** Measurement errors, experimental noise, and other random factors often follow a normal distribution. This is crucial in scientific research and quality control processes.

3. **Financial Markets:** Daily returns of stocks, indices, and other financial assets are often assumed to be normally distributed, though this assumption is often challenged during extreme market events.

4. **Product Quality:** The measurements of product dimensions in manufacturing processes often approximate a normal distribution, allowing for quality control and specification limits.

5. **Educational Testing:** Scores on standardized tests like the SAT or IQ tests are typically assumed to be normally distributed.

6. **Biological Data:** Biological data such as birth weights, body temperatures, and enzyme activity levels can often be modeled using a normal distribution.

7. **Environmental Data:** Environmental factors like air pollution levels, rainfall amounts, and temperature readings may exhibit a normal distribution.

## Q5: What is Bernaulli Distribution? Give an Example. What is the difference between Bernoulli Distribution and Binomial Distribution?

**Bernoulli Distribution:**
The Bernoulli distribution is a probability distribution that models a random experiment with exactly two possible outcomes: success (usually denoted as 1) and failure (usually denoted as 0). It is named after the Swiss mathematician Jacob Bernoulli. The distribution is characterized by a single parameter, p, which represents the probability of success on a single trial.

The probability mass function (PMF) of the Bernoulli distribution is defined as follows:

- P(X = 1) = p (probability of success)
- P(X = 0) = 1 - p (probability of failure)

**Example of Bernoulli Distribution:**
A classic example of a Bernoulli distribution is a single toss of a fair coin. If we define "success" as getting a heads (H) and "failure" as getting a tails (T), then the probability of success (p) is 0.5 (since the coin is fair). The outcome of the experiment follows a Bernoulli distribution.

**Difference between Bernoulli Distribution and Binomial Distribution:**
The key differences between the Bernoulli distribution and the Binomial distribution are as follows:

1. **Number of Trials:**

    - Bernoulli Distribution: It models a single trial or experiment with two possible outcomes (success and failure).
    - Binomial Distribution: It models a series of independent and identical Bernoulli trials, where each trial can result in success or failure.

2. **Parameters:**

    - Bernoulli Distribution: It has one parameter, p, which represents the probability of success in a single trial.
    - Binomial Distribution: It has two parameters, n (the number of trials) and p (the probability of success on each trial).

3. **Probability Mass Function (PMF):**

    - Bernoulli Distribution: The PMF defines the probabilities for exactly one success and one failure (two possible outcomes).
    - Binomial Distribution: The PMF defines the probabilities for a specific number of successes (k) in n trials. The PMF is given by the binomial coefficient C(n, k) times p^k times (1 - p)^(n - k).

4. **Use Cases:**

    - Bernoulli Distribution: It is used to model individual, binary events such as the outcome of a single coin toss, the success or failure of a single trial, or the presence or absence of a specific characteristic.
    - Binomial Distribution: It is used to model the number of successes in a fixed number of independent and identical Bernoulli trials. It is applicable in situations where you want to know the probability of obtaining a certain number of successes in a series of trials.

## Q6. Consider a dataset with a mean of 50 and a standard deviation of 10. If we assume that the dataset is normally distributed, what is the probability that a randomly selected observation will be greater than 60? Use the appropriate formula and show your calculations.

## Q7: Explain uniform Distribution with an example.

A uniform distribution, also known as a rectangular distribution, is a probability distribution in statistics that describes a situation where all values within a specific range have an equal probability of occurring. In other words, each possible outcome is equally likely.

The probability density function (PDF) of a continuous uniform distribution is characterized by a constant probability over a specified interval. The formula for the PDF of a continuous uniform distribution over the interval [a, b] is as follows:

> - f(x) = 1 / (b - a) for a ≤ x ≤ b
> - f(x) = 0 otherwise

Here's an example to illustrate a uniform distribution:

Example: Rolling a Fair Six-Sided Die

Consider rolling a fair six-sided die. The die has six faces, numbered from 1 to 6. When you roll the die, each outcome (1, 2, 3, 4, 5, or 6) has an equal probability of occurring, assuming the die is not biased.

In this case, the probability of each outcome is 1/6 because there are 6 equally likely outcomes, and the total probability must sum to 1. This is an example of a discrete uniform distribution.

In the context of a uniform distribution, this means that if you were to plot a histogram of the results of rolling the die a large number of times, you would see a flat, uniform shape because each number has the same probability of occurring.

## Q8: What is the z score? State the importance of the z score.

The z-score, also known as the standard score or standard deviation score, is a statistical measure that quantifies how many standard deviations a data point is away from the mean (average) of a dataset. It is a way to standardize and compare data points from different distributions, allowing for a better understanding of where a specific data point stands relative to the mean.

The formula to calculate the z-score of a data point, denoted as "z," is as follows:

> z = (x-μ)/σ

Where:

- z is the z-score.
- x is the individual data point.
- μ is the mean (average) of the dataset.
- σ is the standard deviation of the dataset.


The importance of the z-score can be summarized as follows:

1. **Standardization:** Z-scores allow for the standardization of data. By expressing data points in terms of standard deviations from the mean, you can compare and analyze data from different datasets that may have different units or scales.

2. **Identifying Outliers:** Z-scores are commonly used to identify outliers in a dataset. Data points with z-scores that are significantly different from the mean (usually beyond a certain threshold, e.g., z-scores greater than 2 or less than -2) may be considered outliers.

3. **Normal Distribution:** In a standard normal distribution (a special case of the normal distribution with a mean of 0 and a standard deviation of 1), approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. Z-scores help you determine the percentage of data within a given range.

4. **Hypothesis Testing:** Z-scores are often used in hypothesis testing and confidence interval calculations. They help in assessing whether a sample statistic is significantly different from a population parameter.

5. **Data Transformation:** Z-scores are useful in transforming data to have a mean of 0 and a standard deviation of 1. This transformation can be helpful in certain statistical analyses.

6. **Data Analysis:** Z-scores provide a common scale for data, making it easier to compare and interpret values within a dataset.

## Q9: What is Central Limit Theorem? State the significance of the Central Limit Theorem.

The Central Limit Theorem (CLT) is a fundamental concept in statistics. It states that if you take a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from that population will be roughly equal to the population mean. This theorem allows you to simplify problems in statistics by allowing you to work with a distribution that is approximately normal.

The significance of the Central Limit Theorem is vast and it has several applications:
- It is at the heart of hypothesis testing.
- It is used in political/election polling to estimate the number of people who support a specific candidate.
- It is used in various census fields to calculate various population details, such as family income, electricity consumption, individual salaries, and so on.
- It notes that the sample means converge on the population means and the distance between them converges to be normally distributed with a variance equal to the population variance as the sample size increases.

In essence, the Central Limit Theorem is a crucial pillar of statistics and machine learning.


## Q10: State the assumptions of the Central Limit Theorem.

The Central Limit Theorem (CLT) is based on the following assumptions³⁴⁵:

1. **Randomization**: The samples must be drawn randomly.
2. **Independence**: The samples drawn should be independent of each other, meaning one sample should not influence the others.
3. **Sample Size**: When the sampling is done without replacement, the sample size shouldn’t exceed 10% of the total population. Also, the sample size must be sufficiently large, usually n > 30³.