# Assignment (9th March) : Statistics Assignments - 2

### Q1: What are the Probability Mass Function (PMF) and Probability Density Function (PDF)? Explain with an example.

**ANS:** The Probability Mass Function (PMF) and Probability Density Function (PDF) are two fundamental concepts in probability theory and statistics, used to describe the distributions of discrete and continuous random variables, respectively.

**`Probability Mass Function (PMF):`** The Probability Mass Function (PMF) is used for discrete random variables. It gives the probability that a discrete random variable is exactly equal to some value. Mathematically, for a discrete random variable \( X \), the PMF is denoted as \( P(X = x) \) or \( p(x) \).

**Example:**
Consider a fair six-sided die. Let \( X \) be the random variable representing the outcome of a single roll of the die. The PMF of \( X \) is:

[ p(x) = 
\begin{cases} 
\frac{1}{6} & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \\
0 & \text{otherwise}
\end{cases}

This means each of the six outcomes (1 through 6) has an equal probability of ( 1 / 6 ).


**`Probability Density Function (PDF):`** The Probability Density Function (PDF) is used for continuous random variables. It describes the relative likelihood for the random variable to take on a given value. The PDF is denoted as \( f(x) \). Unlike the PMF, the PDF does not give probabilities directly; instead, the area under the PDF curve over an interval gives the probability that the random variable falls within that interval.

**Example:**
Consider a continuous random variable \( X \) that follows a normal distribution with mean \( mu = 0 \) and standard deviation \( sigma = 1 \). The PDF of \( X \) is:

[ f(x) = (1 / sqrt{2pi}) (e^ {-{x^2}/{2}) ]

This is the standard normal distribution. The probability that \( X \) falls within a particular range \([a, b]\) is given by the area under the curve of \( f(x) \) from \( a \) to \( b \):


P(a =< X =< b) = b∫a (1 / sqrt{2π}) e^ (-x^2 / 2) dx
 

### Q2: What is Cumulative Density Function (CDF)? Explain with an example. Why CDF is used?

**ANS:** The Cumulative Distribution Function (CDF) is a fundamental concept in probability theory that describes the probability that a random variable takes on a value less than or equal to a specific value. It is applicable to both discrete and continuous random variables. The CDF provides a comprehensive way to describe the distribution of a random variable.

Mathematically, for a random variable \( X \), the CDF, denoted as \( F(x) \), is defined as:

`[ F(x) = P(X <= x) \]`

This means that \( F(x) \) gives the probability that the random variable \( X \) is less than or equal to \( x \).



### Why CDF is Used:

1. **`Complete Distribution Description`**: The CDF provides a complete description of the distribution of a random variable, including all probabilities associated with it.
2. **`Probability Calculation`**: It is used to compute the probability that a random variable falls within a certain range. For example, \( P(a <= X <= b) = F(b) - F(a) \).
3. **`Quantile Calculation`**: The CDF can be used to find quantiles, which are points in the distribution that correspond to specific probabilities.
4. **`Comparison of Distributions`**: CDFs are useful for comparing different distributions. The CDF plots of different distributions can be compared to see which values are more likely in one distribution than another.
5. **`CDF Inversion`**: The inverse of the CDF is used in sampling from a distribution. Given a probability, the inverse CDF gives the corresponding value of the random variable.



### Q3: What are some examples of situations where the normal distribution might be used as a model? Explain how the parameters of the normal distribution relate to the shape of the distribution.

**ANS:** The normal distribution is characterized by two parameters: 

1. **Mean**: The mean is the central location parameter. It determines the center of the distribution. In a normal distribution, the mean, median, and mode are all equal and located at (mu).

2. **Standard Deviation**: The standard deviation is a measure of the spread or dispersion of the distribution. It determines the width of the bell curve. A smaller (sigma) results in a steeper and narrower curve, while a larger (sigma) results in a flatter and wider curve.

### Relationship Between Parameters and the Shape of the Distribution

1. **Changing the Mean (mu)**:
   - Shifts the entire distribution to the left or right along the horizontal axis.
   - The shape of the distribution does not change; only the center changes.
   - For example, if (mu) increases, the peak of the distribution moves to the right.

2. **Changing the Standard Deviation (sigma)**:
   - Affects the spread of the distribution.
   - A larger (sigma) spreads the distribution out more, making it flatter and wider.
   - A smaller (sigma) makes the distribution steeper and narrower.
   - The area under the curve remains the same (equal to 1), as it represents the total probability.

### Examples of Situations Where the Normal Distribution Might Be Used

The normal distribution, also known as the Gaussian distribution, is widely used as a model in various fields due to its properties and the Central Limit Theorem. Here are some examples:

1. **Height and Weight of Individuals**: The heights and weights of people in a population often follow a normal distribution, especially when considering a large sample size.

2. **Test Scores**: Scores on standardized tests (e.g., SAT, IQ tests) are often modeled using a normal distribution.

3. **Measurement Errors**: Measurement errors in scientific experiments tend to follow a normal distribution due to the aggregation of many small, independent errors.

4. **Stock Prices and Financial Returns**: While stock prices themselves may not follow a normal distribution, the logarithm of stock prices or the returns over a short period can often be approximated by a normal distribution.

5. **Quality Control**: In manufacturing, the variation in dimensions of machine parts often follows a normal distribution. This is used in quality control processes to determine acceptable ranges.

6. **Biological Measurements**: Many biological measurements, such as blood pressure, enzyme activity, and reaction times, tend to be normally distributed.

### Q4: Explain the importance of Normal Distribution. Give a few real-life examples of Normal Distribution.

**ANS:** The normal distribution is critically important in statistics and many real-world applications for several reasons:

1. **`Central Limit Theorem (CLT)`**: The CLT states that the sum (or average) of a large number of independent, identically distributed random variables tends toward a normal distribution, regardless of the original distribution of the variables. This makes the normal distribution a key tool for inference and hypothesis testing.

2. **`Mathematical Properties`**: The normal distribution has convenient mathematical properties that facilitate analytical calculations. Many statistical methods and tests assume normality due to its simplicity and tractability.

3. **`Descriptive Statistics`**: Many real-world phenomena naturally follow a normal distribution, making it a good model for many types of data. This allows for effective summarization and analysis of data.

4. **`Probabilistic Interpretation`**: The normal distribution provides a clear probabilistic framework for predicting outcomes and understanding variability in data.

5. **`Standardization`**: Data can be transformed into a standard normal distribution (mean = 0, standard deviation = 1) using z-scores, allowing for comparison across different datasets and variables.



#### Real-Life Examples of Normal Distribution

1. **`Human Heights`**:
   - **Example**: The heights of adult men and women in a population tend to follow a normal distribution.
   - **Importance**: This allows for the prediction of height-related statistics and informs areas such as clothing manufacturing and ergonomic design.

2. **`IQ Scores`**:
   - **Example**: IQ scores are designed to follow a normal distribution with a mean of 100 and a standard deviation of 15.
   - **Importance**: This distribution allows for the assessment of cognitive abilities across a population and the identification of individuals with exceptionally high or low scores.

3. **`Measurement Errors`**:
   - **Example**: In scientific experiments, the errors in measurements due to instrument precision often follow a normal distribution.
   - **Importance**: Understanding the distribution of measurement errors helps in estimating the accuracy and reliability of experimental results.



### Q5: What is Bernaulli Distribution? Give an Example. What is the difference between Bernoulli Distribution and Binomial Distribution?

**ANS:** The Bernoulli distribution is a discrete probability distribution for a random variable that has `only two possible outcomes`: success (usually coded as 1) and failure (usually coded as 0). It is the simplest type of discrete distribution and is used to model binary outcomes.

- **`Example:`**
Consider flipping a fair coin. Let (X) be the random variable representing the outcome, where heads (success) is coded as 1 and tails (failure) is coded as 0. The Bernoulli distribution for this scenario has (p = 0.5).

**`Differences Between Bernoulli and Binomial Distributions:`**

1. **Number of Trials**:
   - **Bernoulli Distribution**: Models a single trial with two possible outcomes (success or failure).
   - **Binomial Distribution**: Models the number of successes in \( n \) independent Bernoulli trials.


2. **Parameters**:
   - **Bernoulli Distribution**: Has a single parameter \( p \), which is the probability of success.
   - **Binomial Distribution**: Has two parameters \( n \) (number of trials) and \( p \) (probability of success in each trial).


3. **Support (Possible Values)**:
   - **Bernoulli Distribution**: The random variable can take only two values: 0 (failure) and 1 (success).
   - **Binomial Distribution**: The random variable can take any integer value from 0 to \( n \), representing the number of successes in \( n \) trials.


4. **Application**:
   - **Bernoulli Distribution**: Used for single trials or binary outcomes (e.g., flipping a coin once).
   - **Binomial Distribution**: Used for multiple trials to count the number of successes (e.g., number of heads in 10 coin flips).



### Q6. Consider a dataset with a mean of 50 and a standard deviation of 10. If we assume that the dataset is normally distributed, what is the probability that a randomly selected observation will be greater than 60? Use the appropriate formula and show your calculations.

**ANS:** To find the probability that a randomly selected observation from a normally distributed dataset with a mean of 50 and a standard deviation of 10 is greater than 60, we can use the standard normal distribution (z-score) and standard normal distribution tables (or a calculator). The steps are as follows:

1. **`Calculate the z-score`**:
   The z-score is a measure of how many standard deviations an element is from the mean. The formula for the z-score is:

   [ z = (X - mu) / (sigma)]

   Where:
   - (X) is the value of interest (60 in this case).
   - (mu) is the mean of the distribution (50).
   - (sigma) is the standard deviation of the distribution (10).

   Substituting the values:

   [ z = (60 - 50) / (10) = (10) / (10) = 1 ]

2. **`Find the probability corresponding to the z-score`**:
   We need to find \( P(Z > 1) \), where \( Z \) is a standard normal random variable.

   Using standard normal distribution tables or a calculator, we find the cumulative probability for \( Z <= 1 \).

   \[
   P(Z <= 1) approx 0.8413
   \]

   This value represents the probability that a z-score is less than or equal to 1.

3. **`Calculate the probability of (Z > 1)`**:
   Since the total area under the standard normal distribution curve is 1, the probability that \( Z \) is greater than 1 is:

   \[
   P(Z > 1) = 1 - P(Z <= 1)
   \]

   Substituting the value from the table:

   \[
   P(Z > 1) = 1 - 0.8413 = 0.1587
   \]




The probability that a randomly selected observation from this normally distributed dataset will be `greater than 60 is approximately 0.1587, or 15.87%.




### Q7: Explain uniform Distribution with an example.

**ANS:** The uniform distribution is a type of probability distribution in which all outcomes are equally likely. Each value in a finite range is equally probable, and the distribution can be either discrete or continuous.

1. **`Discrete Uniform Distribution:`** A discrete uniform distribution is defined over a set of (n) distinct outcomes, each of which has an equal probability. 

**Example:**
Consider rolling a fair six-sided die. The possible outcomes are {1, 2, 3, 4, 5, 6}, and each outcome has an equal probability of ((1) / (6)).


2. **`Continuous Uniform Distribution:`** A continuous uniform distribution is defined over a continuous range (a, b), where every value within this range is equally likely.

**Example:**
Consider a random variable (X) that represents the time (in minutes) a person waits for a bus that arrives at a bus stop every hour. The waiting time is uniformly distributed between 0 and 60 minutes.




### Q8: What is the z score? State the importance of the z score.

**ANS:** The z score, also known as the standard score, measures how many standard deviations an individual data point is from the mean of a dataset. It is a way to standardize data points within different distributions, allowing for comparison across different scales.

The formula for calculating the z score of a data point (x) is:

[z = (x - mu) / (sigma)]

Where:
- (x) is the value of the data point.
- (mu) is the mean of the dataset.
- (sigma) is the standard deviation of the dataset.

**`Importance of the Z Score:`**

1. **Standardization**: The z score standardizes data points, allowing for comparison between different datasets or distributions. This is especially useful when dealing with data measured on different scales or units.

2. **Identifying Outliers**: Z scores help in identifying outliers in the data. Data points with a z score greater than 3 or less than -3 are often considered outliers, as they lie far from the mean.

3. **Normal Distribution**: In a normal distribution, z scores provide a way to determine the probability of a data point occurring within a certain range. This is useful in statistical analysis and hypothesis testing.

4. **Comparing Scores Across Different Distributions**: Z scores allow for comparison of scores from different distributions by converting them to a common scale. For example, comparing SAT scores and ACT scores, which are on different scales, can be done using z scores.

5. **Probabilistic Interpretation**: Z scores are used to calculate probabilities and percentiles in a standard normal distribution. This is helpful in various applications, such as quality control, finance, and social sciences.

6. **Hypothesis Testing**: In hypothesis testing, z scores are used to determine how far away a sample mean is from the population mean under the null hypothesis. This helps in making decisions about the validity of the null hypothesis.




### Q9: What is Central Limit Theorem? State the significance of the Central Limit Theorem.

**ANS:** The Central Limit Theorem (CLT) is a fundamental theorem in statistics that describes the characteristics of the sampling distribution of the sample mean. The theorem states that, given a sufficiently large sample size, the sampling distribution of the sample mean will be approximately normally distributed, regardless of the original distribution of the population from which the sample is drawn. This holds true as long as the population has a finite variance.



**`Significance of the Central Limit Theorem:`**

1. **Foundation for Inferential Statistics**: The CLT provides the theoretical foundation for making inferences about population parameters using sample statistics. It allows us to use sample data to estimate population parameters and make predictions.

2. **Approximation of Distribution**: The CLT justifies the use of the normal distribution as an approximation for the distribution of the sample mean, even when the underlying population distribution is not normal. This is particularly useful when dealing with non-normally distributed data.

3. **Simplification of Analysis**: Because the sample mean follows a normal distribution for large sample sizes, many statistical procedures and tests (such as confidence intervals and hypothesis tests) that assume normality can be applied, simplifying the analysis.

4. **Applications in Real-World Problems**: The CLT is widely used in various fields, such as economics, engineering, social sciences, and natural sciences, where it helps in analyzing and interpreting data.

5. **Law of Large Numbers**: The CLT complements the Law of Large Numbers, which states that as the sample size increases, the sample mean converges to the population mean. Together, these theorems provide a robust framework for understanding the behavior of sample statistics.




### Q10: State the assumptions of the Central Limit Theorem.

**ANS:** The Central Limit Theorem (CLT) makes several key assumptions that must be satisfied for the theorem to hold true. These assumptions are crucial for ensuring that the sampling distribution of the sample mean approximates a normal distribution as the sample size becomes large.

1. **`Independence`**:   The sampled observations must be independent of each other. This means that the outcome of one observation should not affect the outcome of another. In practice, this is often achieved by random sampling.


2. **`Identically Distributed`**:  The sampled observations should be drawn from the same population, meaning they should be identically distributed. This implies that each observation comes from the same probability distribution with the same mean (mu) and variance (sigma^2).


3. **`Finite Variance`**:  The population from which the sample is drawn must have a finite variance (sigma^2 < infty). If the variance is infinite, the CLT does not apply.


4. **`Sample Size`**:  The sample size (n) should be sufficiently large. While there is no strict rule for what constitutes a "large" sample size, a common guideline is (n >= 30). However, the required sample size can vary depending on the shape of the underlying population distribution:
     - If the population distribution is approximately normal, even smaller sample sizes can suffice.
     - If the population distribution is heavily skewed or has heavy tails, larger sample sizes are necessary to achieve a normal approximation.

