# **Statistics Basics**

# **Assignment Questions**

**1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.**

**Types of Data**


**1. Qualitative Data**


- Definition: Describes attributes, labels, or characteristics that cannot be measured or counted in numbers.


- Key Features:


- Non-numerical in nature.

- Often categorized into groups or classes.

- Examples:

- Nominal Data: Categories without any inherent order.

- Examples: Gender (male, female), types of fruits (apple, banana, orange), marital status (single, married).


- Ordinal Data: Categories with a meaningful order but no uniform difference between them.


- Examples: Education level (high school, undergraduate, graduate), customer satisfaction (poor, fair, good, excellent).


**2. Quantitative Data**


- Definition: Represents numerical values that quantify a characteristic and can be measured or counted.


- Key Features:


- Numeric in nature.

- Further divided into discrete and continuous data.


- Examples:

- Interval Data: Numerical data with meaningful intervals but no true zero point.


- Examples: Temperature in Celsius or Fahrenheit, years in a calendar.

- Ratio Data: Numerical data with a true zero point, allowing for meaningful comparisons of ratios.

- Examples: Height, weight, age, income.

- **Scales of Measurement**


**1. Nominal Scale**


- Definition: Categorizes data into distinct groups or classes without a specific order.


- Characteristics:

- No quantitative value or ranking.


- Cannot perform mathematical operations.

- Examples:

- Blood groups (A, B, AB, O).

- Eye color (blue, green, brown).

**2. Ordinal Scale**


- Definition: Arranges data in a specific order, but the intervals between data points are not necessarily equal.


- Characteristics:

- Reflects relative ranking or position.

- Does not provide precise differences between ranks.

- Examples:

- Movie ratings (1 star to 5 stars).

- Class ranks (1st, 2nd, 3rd).

**3. Interval Scale**


- Definition: Numerical data where intervals between values are consistent, but there is no true zero.


- Characteristics:


- Allows addition and subtraction but not meaningful ratios.


- Zero is arbitrary (does not imply absence).


- Examples:

- IQ scores.

- Temperature in Celsius or Fahrenheit.

**4. Ratio Scale**


- Definition: Numerical data with consistent intervals and a true zero point, making ratios meaningful.

- Characteristics:

- Permits all arithmetic operations (addition, subtraction, multiplication, division).

- True zero indicates the absence of the measured property.

- Examples:

- Distance traveled (0 km means no travel).

- Weight (0 kg means no weight).

**2.What are the measures of central tendency, and when should you use each? Discuss the mean, median,and mode with examples and situations where each is appropriate.**

**Measures of Central Tendency**


Central tendency measures describe the center or typical value of a data set. The three primary measures are mean, median, and mode. Each has specific use cases depending on the nature of the data and the context of analysis.

**1. Mean (Arithmetic Average)**


- The mean is calculated by summing all the data values and dividing by the number of values.

**Formula:**

- Mean= Number of values/Sum of all values

- Example:
For the data set {4, 8, 15, 16, 23, 42}:

 Mean= 4+8+15+16+23+42/6 =16


 **When to Use:**


- Data is continuous or interval/ratio-scaled.


- Distribution is symmetric with no significant outliers.


**Avoid Mean:**

- When the data contains extreme outliers (e.g., income data).

- In skewed distributions where it may not represent the center accurately.


**2. Median**


The median is the middle value in an ordered data set. If the number of values is even, the median is the average of the two middle numbers.

**Example:**


- For the data set{ 4, 8, 15, 16 , 23, 42} (already ordered):

- number of values =6(even)

- Median = 15+16 /2 = 15.5



**When to Use:**


- Data is ordinal or interval/ratio-scaled.


- When the data contains outliers or is skewed (e.g., house prices).


**Avoid Median:**


- When the exact average is important.


- For small sample sizes, where it may not be stable.


**3. Mode**


The mode is the value that appears most frequently in the data set. A data set can have:

- No mode (if all values occur only once),


- One mode (unimodal), or


- Multiple modes (bimodal or multimodal).


**Example:**


For the data set{1,2,2,3,3,3,3,4,5}:

- Mode = 3 (occurs most frequently).

**When to Use:**


- Data is nominal (categorical) and you're identifying the most common category.

- Example: Finding the most popular color in a survey (e.g., red, blue, green).


- In bimodal or multimodal distributions, to highlight frequent peaks.

**Avoid Mode:**

- When the data has no repeated values.


- For numerical data with no clear frequent value.






**3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?**

**Concept of Dispersion**


Dispersion refers to the extent to which data values in a dataset spread out or deviate from the central value (e.g., mean or median). It measures the variability or diversity in the data and helps to understand whether the data points are tightly clustered or widely scattered.



**Importance of Dispersion**


- Insight into variability: It shows the consistency or inconsistency of data.


- Comparison: Helps compare the spread between different datasets.


- Decision-making: Useful in risk analysis and quality control.


**Measures of Dispersion**


- The two most common measures of dispersion are variance and standard deviation, which quantify how far individual data points are from the mean.



**Variance**

- Variance measures the average squared deviation of each data point from the mean. It gives a sense of the overall spread of the data.


**Standard Deviation**


- The standard deviation (SD) is the square root of the variance, representing the spread in the same units as the original data.


**How They Measure the Spread of Data**


**Variance:**


- Emphasizes larger deviations by squaring them.


- Shows the overall level of variability in the dataset.


**Standard Deviation:**



- Offers a more intuitive measure of spread in the same unit as the data.


- Helps in practical applications like identifying typical deviations from the mean.


 **4. What is a box plot, and what can it tell you about the distribution of data?**

**Box Plot (Box-and-Whisker Plot)**



A box plot is a graphical representation of a dataset’s distribution. It summarizes the data using five key statistical measures, providing a visual summary of its central tendency, spread, and potential outliers.

**Key Components of a Box Plot**


1.Median (Q2): The line inside the box represents the median, which divides the dataset into two equal halves.


2.Interquartile Range (IQR):


- The box spans from the first quartile (Q1) (25th percentile) to the third quartile (Q3) (75th percentile).

- IQR = 1Q3-Q1, representing the middle 50% of the data.

3.Whiskers:


- Extend from the box to the smallest and largest values within 1.5 * IQR from Q1 and Q3.


- These indicate the range of most data values, excluding outliers.


4.Outliers:


Data points outside Q1−1.5× IQR or Q3+1.5× IQR are plotted individually as dots or symbols.


5.Minimum and Maximum:


- The smallest and largest values within the whiskers.


**What a Box Plot Reveals**



1.Center of Data:

- The median shows the central value of the dataset.


- You can compare medians across groups to understand differences in central tendency.


2.Spread (Variability):


- The length of the box (IQR) shows the spread of the middle 50% of data.


- Longer whiskers indicate more variability outside the IQR.


3.Symmetry and Skewness:

- If the box is symmetric around the median, the data is evenly distributed.


- If one whisker or side of the box is longer, the data is skewed.


- Longer right whisker → Right-skewed (positive skew).


- Longer left whisker → Left-skewed (negative skew).


4.Outliers:

- Outliers are identified as individual points outside the whiskers.

- They indicate unusual or extreme values in the dataset.


5.Comparison:

- Multiple box plots can compare distributions across different groups or categories.


**Example of Interpretation**


For a box plot of students' test scores:

- Median: The median score is 75, indicating that half the students scored below 75 and half scored above.


- IQR: The IQR is 20 (Q3 = 85, Q1 = 65), showing the middle 50% of students scored between 65 and 85.


- Whiskers: Scores range from 50 to 100, excluding outliers.


- Outliers: A few students scored below 50, indicating they struggled significantly compared to their peers.


- Skewness: If the whisker on the higher end is longer, it suggests a right-skewed distribution, possibly indicating a few very high scores.



**5. Discuss the role of random sampling in making inferences about populations.**

**Role of Random Sampling in Making Inferences About Populations**



Random sampling is a fundamental method in statistics that involves selecting a subset of individuals or elements from a larger population in such a way that every member of the population has an equal chance of being included. It plays a critical role in making reliable inferences about populations.

**Key Roles of Random Sampling**


1.Representativeness:

- Random sampling ensures that the sample represents the characteristics of the entire population, minimizing biases.


- This representativeness allows statisticians to generalize findings from the sample to the broader population with confidence.


2.Reduction of Bias:

- Since each member of the population has an equal chance of selection, the process reduces selection bias.


- It ensures that the sample is not skewed toward a specific subgroup.


3.Facilitates Statistical Inference:

- Random samples allow the use of probability theory to estimate population parameters (e.g., mean, proportion) and calculate margins of error.


- Statistical tests and confidence intervals are based on the assumption of random sampling.


4. Validates Hypothesis Testing:

- Random samples provide the foundation for conducting hypothesis tests, helping researchers determine whether observed differences or effects are due to chance or a real phenomenon.


5.Generalizability:

- Findings derived from a random sample are more likely to be generalizable to the population than those from a non-random sample.


- This generalizability is critical in fields like medicine, economics, and social sciences.


6.Estimation of Population Parameters:


- Random sampling enables accurate estimation of population parameters such as the mean, variance, and proportion.


- For example, a sample mean is an unbiased estimator of the population mean.


**Example**


Suppose a school wants to estimate the average height of its students. Instead of measuring every student, they randomly select 100 students. If the sampling is random:

- The sample mean height can be used to estimate the population mean height.


- Confidence intervals can quantify the uncertainty in the estimate.


- Any patterns or relationships observed in the sample (e.g., height by grade level) can likely be generalized to the whole school population.



**Challenges and Considerations**


1.Sample Size:

- A larger random sample typically provides more accurate and reliable estimates of population parameters.


- Small random samples may lead to higher variability and less reliable inferences.


2.Practical Constraints:

- It can be difficult or costly to ensure randomness in practice (e.g., due to accessibility, budget, or time constraints).


3.Non-response Bias:

- If certain individuals in the random sample do not respond, the resulting sample may no longer be representative.


4.Independence:

- True random sampling assumes that each selection is independent, which may not always be achievable in real-world scenarios.



**6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?**

**Concept of Skewness**


Skewness is a statistical measure that describes the asymmetry or lack of symmetry in a data distribution. It indicates whether the data values are distributed evenly around the central value or if they lean toward one side.

- A perfectly symmetric distribution (e.g., normal distribution) has no skewness.


- Skewness provides insight into the direction and extent to which a dataset deviates from symmetry.


**Types of Skewness**


1.Symmetric Distribution:

- Skewness = 0.


- The left and right sides of the distribution mirror each other.


- Example: Heights of adults in a population often follow a symmetric, normal distribution.


2.Positive Skewness (Right-Skewed):


- Skewness > 0.


- The tail on the right side of the distribution is longer or fatter than the left side.


- Most data points are concentrated on the lower end, with some extreme higher values.


- Example: Income distribution, where a few high incomes stretch the tail.


3.Negative Skewness (Left-Skewed):



- Skewness < 0.


- The tail on the left side of the distribution is longer or fatter than the right side.


- Most data points are concentrated on the higher end, with some extreme lower values.


- Example: Scores on an easy test, where most students score high but a few score low.



**How Skewness Affects Interpretation of Data**


1.Central Tendency:

- Mean, median, and mode differ in skewed distributions:


- Right-skewed: Mean > Median > Mode.


- Left-skewed: Mean < Median < Mode.

- The mean is pulled toward the tail, making the median a better measure of central tendency in skewed data.


2.Spread and Variability:

- Skewness can exaggerate the perception of variability, particularly when the tail includes extreme outliers.


3.Outliers:

- Skewed distributions often contain significant outliers in the tail, influencing summary statistics and regression models.


4.Interpretation of Results:



- In a right-skewed income distribution, focusing on the mean might overestimate the typical income.


- For left-skewed test scores, the mean might underestimate typical performance.


5.Choosing Statistical Methods:

- Skewness affects which statistical tests and measures are appropriate.


- Parametric tests often assume normality (symmetry); skewed data may require transformations (e.g., logarithmic) or non-parametric tests.


6.Visualization:



- Skewed data may require special visualizations (e.g., log-scale histograms) to reveal patterns or trends effectively.



**Examples of Real-Life Implications**


1.Healthcare:

- Right-skewed distributions of hospital stays suggest that most patients stay briefly, but a few require extended care.


- Median stay may be more informative than mean.


2.Finance:

- Investment returns are often right-skewed due to rare, large positive gains.


- Skewness must be accounted for in risk assessment.


3.Education:

- Test scores with negative skewness indicate high overall performance, but interventions may be needed for low-performing students.



**7. What is the interquartile range (IQR), and how is it used to detect outliers?**

**What is the Interquartile Range (IQR)?**


The Interquartile Range (IQR) is a measure of statistical dispersion that describes the range within which the middle 50% of data values lie. It is the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset.

Formula:

IQR = Q3-Q1


- Q1 (First Quartile): The 25th percentile of the data; 25% of the data values are below this point.


- Q3 (Third Quartile): The 75th percentile of the data; 75% of the data values are below this point.


**Why is IQR Important?**


- It measures the spread of the central portion of the data, ignoring extreme values.


- It is robust against outliers since it focuses on the middle 50% of the data.

**How is IQR Used to Detect Outliers?**


Outliers are data points that lie significantly outside the range of most of the data. The IQR helps to identify these outliers using the following rules:

Steps to Detect Outliers:


1.Calculate IQR: IQR=Q3-Q1


2.Determine the Lower and Upper Bounds:

- Lower Bound: Q1 - 1.5 × IQR


- Upper Bound:  Q3+1.5 × IQR


3.Identify Outliers:

- Any data point below the lower bound or above the upper bound is considered an outlier.



**Advantages of Using IQR for Outlier Detection**


- Robust to Outliers: Unlike the mean and standard deviation, the IQR is not affected by extreme values.


- Non-parametric: Works well with skewed data and non-normal distributions.


- Simple and Visual: Often visualized in box plots to highlight outliers.


**Applications of IQR**


- Data Cleaning: Identifying and handling outliers in datasets.


- Risk Management: Detecting anomalies in financial data or stock prices.


- Quality Control: Spotting defective items in manufacturing.


By using IQR, analysts can focus on the most meaningful part of the data and address outliers effectively, improving the accuracy and reliability of statistical analyses.



**8. Discuss the conditions under which the binomial distribution is used?**

**Binomial Distribution: Overview**


The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent trials of a binary (two-outcome) experiment.



Each trial must satisfy certain conditions, and the binomial distribution provides the probabilities of obtaining a specific number of successes in these trials.



**Conditions for Using the Binomial Distribution**


For a scenario to follow a binomial distribution, the following conditions must be met:

1. Fixed Number of Trials (n):

- The experiment consists of a predetermined number of trials.


- Each trial is conducted under identical conditions.



Example: Tossing a coin 10 times.


2.Binary Outcomes:


- Each trial has only two possible outcomes, commonly referred to as success and failure.


- Success is often denoted as p and failure as 1−p.


Example: In a coin toss, outcomes are heads (success) or tails (failure).

3.Constant Probability (p):


- The probability of success (p) remains the same across all trials.


- Similarly, the probability of failure (1−p) does not change.


Example: In a fair coin toss, p=0.5 for heads throughout all trials.



4.Independence of Trials:

- The outcome of one trial does not influence the outcomes of other trials.


- Trials are statistically independent.


 Example: Tossing a coin multiple times ensures that the result of one toss does not affect the next.


**Examples of Scenarios That Fit Binomial Distribution**


1.Coin Toss:

- Toss a coin 10 times (n=10).


- Success is getting heads (p=0.5).


2.Quality Control:


- Inspect 20 products to check if they meet quality standards.


- Success is a product passing inspection (p=0.95).


3.Medical Trials:

- Test a new drug on 50 patients.


- Success is the patient responding positively (p=0.8).


4.Customer Feedback:

- Survey 100 customers to determine if they are satisfied with a service.


- Success is a satisfied customer (p=0.7).


**When Not to Use the Binomial Distribution**


The binomial distribution is not appropriate if:

1.The trials are not independent (e.g., sampling without replacement from a small population).


2.The number of trials (n) is not fixed.


3.The outcomes are not binary (e.g., having more than two categories).


4.The probability of success (p) changes between trials.





 **9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).**

**Normal Distribution: Overview**


The normal distribution is a continuous probability distribution that is symmetric about its mean and commonly referred to as the bell curve due to its shape. It is widely used in statistics to model real-world phenomena because many natural and social processes follow a normal distribution.



**Properties of the Normal Distribution**


1.Symmetry:

- The distribution is symmetric around the mean (μ).


- The left and right sides of the curve are mirror images.


2.Mean, Median, and Mode:

- For a normal distribution, the mean, median, and mode are all equal and located at the center of the distribution.


3.Bell-shaped Curve:



- The highest point of the curve occurs at the mean, and the curve tapers off symmetrically as it moves away from the mean.


4.Asymptotic Nature:

- The tails of the curve approach the horizontal axis but never touch it, extending infinitely in both directions.


5.Defined by Two Parameters:

- The mean (μ) determines the center of the distribution.


- The standard deviation (σ) determines the spread or dispersion of the data.


6.Total Area Under the Curve:


- The total area under the curve equals 1, representing the entire probability.

7.Empirical Rule (68-95-99.7 Rule):

- A fixed percentage of data lies within certain ranges around the mean, as described below.


**Empirical Rule (68-95-99.7 Rule)**


The empirical rule provides a quick approximation of the spread of data in a normal distribution. It states that:

1. 68% of Data Within 1 Standard Deviation (μ±σ):


- About 68% of the data values lie within one standard deviation of the mean.


2. 95% of Data Within 2 Standard Deviations (μ±2σ):

- About 95% of the data values lie within two standard deviations of the mean.


3. 99.7% of Data Within 3 Standard Deviations (μ±3σ):


- About 99.7% of the data values lie within three standard deviations of the mean.


**Applications of the Empirical Rule**


1.Understanding Variability:

- Helps quickly estimate the proportion of data within a range.


2.Detecting Outliers:

- Data points lying beyond 3 standard deviations (μ±3σ) are often considered outliers.


3.Standardized Testing:

- Exam scores, such as IQ or SAT, often follow a normal distribution, and the empirical rule provides a framework for interpreting scores.


4.Quality Control:

- Used in manufacturing to monitor deviations in product quality.

**10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.**

**Real-Life Example of a Poisson Process**


A Poisson process models the occurrence of random events over time or space, where:

1.Events occur independently.


2.The average rate of occurrence (λ) is constant.


3.Two events cannot occur simultaneously.


**Example Scenario**


A customer service center receives an average of 5 calls per hour. The arrival of calls is random and independent.

We are interested in finding the probability that the center receives exactly 3 calls in a given hour.



**Poisson Distribution Formula**


The probability of observing k events in a fixed interval is given by:

P(X=k)= λke −λ/ k!



where:



- X: Number of events (e.g., calls).


- λ: Average rate of occurrence (e.g., 5 calls per hour).


- k: Specific number of events (e.g., 3 calls).


- e: Euler's number (≈2.718).

**Given Data**


- Average rate (λ) = 5 calls/hour.


- Number of events (k) = 3 calls.


Substitute into the formula:

P(X=3)= 5*3e − 5/3!


**Step-by-Step Calculation**


1.Compute 5*3
 :

5*3= 125

2.Compute e*− 5 : Using e ≈ 2.718,

e*− 5≈0.00674

3.Compute 3!: 3!=3 × 2 × 1=6






4.Combine terms: P(X=3)= 125×0.00674 /6



5.Final result: P(X=3)≈ 0.8425 / 6 ≈0.1404
​



**11. Explain what a random variable is and differentiate between discrete and continuous random variables.**

**What is a Random Variable?**



A random variable is a numerical value assigned to the outcomes of a random experiment. It provides a way to map outcomes to numbers for easier mathematical analysis. Random variables can take on different values, depending on the random process or experiment.



**Types of Random Variables**



1. Discrete Random Variables:


- Definition: A random variable is discrete if it can take on a countable number of distinct values.


- Key Characteristics:


- Values are usually integers.


- Can be listed explicitly or counted (e.g., 0, 1, 2, ...).


- Examples:


- Number of heads in 5 coin tosses.


- Number of customers entering a shop in an hour.


-Number of defective items in a batch.




Probability Distribution:

- Represented by a probability mass function (PMF), which gives the probability of each possible value.


Example: Let X represent the number of heads in 3 coin tosses. Possible valuesof X are 0,1,2,3, with corresponding probabilities.



2. Continuous Random Variables:



- Definition: A random variable is continuous if it can take on an infinite number of possible values within a given range.


- Key Characteristics:


- Values are not countable but can be measured.



- Can take any value within a range (e.g., real numbers).


- Examples:


- Heights of people in a population.


- Time taken to complete a task.


- Temperature on a given day.



Probability Distribution:

- Represented by a probability density function (PDF).


-
The probability of the random variable taking on an exact value is zero (P(X=x)=0); instead, probabilities are given over intervals.


Example: Let Y represent the time (in minutes) it takes for a train to arrive. Possible values of Y are any real number between, say, 0 and 60.




**Illustrative Example**


1.Discrete Random Variable:

- A company sells between 0 and 10 items daily.


- Random variable X: Number of items sold.


- Possible values: X={0,1,2,…,10}.


- PMF: P(X=x) gives the probability of selling x items.



2.Continuous Random Variable:


- Measure the time (in minutes) customers wait in line.


- Random variable Y: Waiting time.


- Possible values: Y∈[0,∞).


- PDF: f(Y=y) describes the likelihood of waiting times within intervals (e.g.,
P(2≤Y≤5)).


**12.  Provide an example dataset, calculate both covariance and correlation, and interpret the results**

**Example Dataset**



Let's consider a dataset representing the height (in cm) and weight (in kg) of 5 individuals:



Individual - 1    2    3    4   5

height(x) - 160, 165, 170, 175, 180

weight(y)- 55, 60, 65, 70, 75


We will calculate the covariance and correlation between height and weight.

**Step 1: Calculate Covariance**



The formula for covariance is:

Cov(X,Y)=
n
/1
​
  
i=1
∑
n
​
 (X
i
​
 −
**X**
ˉ
 )(Y
i
​
 −
Y
ˉ
 )



Where:

- 𝑋
𝑖
X
i
​
  and
𝑌
𝑖
Y
i
​
  are individual values of the variables.


- 𝑋
ˉ
X
ˉ
  and
𝑌
ˉ
Y
ˉ
  are the means of
𝑋
X and
𝑌
Y, respectively.


- 𝑛
n is the number of data points.



**Steps for Covariance Calculation:**



Calculate the means:



Mean of heights (
𝑋
ˉ):



X
ˉ
 =

160+165+170+175+180/5
​
 =170


- Mean of weights (
𝑌
ˉ
Y
ˉ):= 55+60+65+70+75/5 = 65




2.Calculate deviations from the mean and their products:

- (
𝑋
1
−
𝑋
ˉ
)
(
𝑌
1
−
𝑌
ˉ
)
=
(
160
−
170
)
(
55
−
65
)
=
(
−
10
)
(
−
10
)
=
100
(X
1
​
 −
X
ˉ
 )(Y
1
​
 −
Y
ˉ
 )=(160−170)(55−65)=(−10)(−10)=100



- (
𝑋
2
−
𝑋
ˉ
)
(
𝑌
2
−
𝑌
ˉ
)
=
(
165
−
170
)
(
60
−
65
)
=
(
−
5
)
(
−
5
)
=
25
(X
2
​
 −
X
ˉ
 )(Y
2
​
 −
Y
ˉ
 )=(165−170)(60−65)=(−5)(−5)=25




- (
𝑋
3
−
𝑋
ˉ
)
(
𝑌
3
−
𝑌
ˉ
)
=
(
170
−
170
)
(
65
−
65
)
=
(
0
)
(
0
)
=
0
(X
3
​
 −
X
ˉ
 )(Y
3
​
 −
Y
ˉ
 )=(170−170)(65−65)=(0)(0)=0





- (
𝑋
4
−
𝑋
ˉ
)
(
𝑌
4
−
𝑌
ˉ
)
=
(
175
−
170
)
(
70
−
65
)
=
(
5
)
(
5
)
=
25
(X
4
​
 −
X
ˉ
 )(Y
4
​
 −
Y
ˉ
 )=(175−170)(70−65)=(5)(5)=25




- (
𝑋
5
−
𝑋
ˉ
)
(
𝑌
5
−
𝑌
ˉ
)
=
(
180
−
170
)
(
75
−
65
)
=
(
10
)
(
10
)
=
100
(X
5
​
 −
X
ˉ
 )(Y
5
​
 −
Y
ˉ
 )=(180−170)(75−65)=(10)(10)=100




3.Sum the products:

100
+
25
+
0
+
25
+
100
=
250
100+25+0+25+100=250




4.Divide by the number of data points (
𝑛
=
5
n=5):

Cov
(
𝑋
,
𝑌
)
=
250
5
=
50
Cov(X,Y)=
5
250
​
 =50




So, the covariance between height and weight is 50.







**Step 2: Calculate Correlation**






The formula for correlation (Pearson’s correlation coefficient) is:

𝑟
=
Cov
(
𝑋
,
𝑌
)
/
𝜎
𝑋
𝜎
𝑌

​

Where:

- Cov
(
𝑋
,
𝑌
)
Cov(X,Y) is the covariance.





- 𝜎
𝑋
σ
X
​
  is the standard deviation of
𝑋
X.





- 𝜎
𝑌
σ
Y
​
  is the standard deviation of
𝑌
Y.




**Steps for Correlation Calculation:**




1.Calculate the standard deviation of height (
𝜎
𝑋
σ
X
​
 ):




- Variance of heights:
𝜎
𝑋
2
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑋
𝑖
−
𝑋
ˉ
)
2
σ
X
2
​
 =
n
1
​
  
i=1
∑
n
​
 (X
i
​
 −
X
ˉ
 )
2

𝜎
𝑋
2
=
(
160
−
170
)*
2
+
(
165
−
170
)*
2
+
(
170
−
170
)*
2
+
(
175
−
170
)*
2
+
(
180
−
170
)*
2
/5
=
100
+
25
+
0
+
25
+
100
/5
=
50
σ
X
2
​
 =
5
(160−170)
2
 +(165−170)
2
 +(170−170)
2
 +(175−170)
2
 +(180−170)
2

​
 =
5
100+25+0+25+100
​
 =50
𝜎
𝑋
=
50
≈
7.071
σ
X
​
 =
50
​
 ≈7.071





2.Calculate the standard deviation of weight (
𝜎
𝑌
σ
Y
​
 ):



- Variance of weights:
𝜎
𝑌
2
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑌
𝑖
−
𝑌
ˉ
)
2
σ
Y
2
​
 =
n
1
​
  
i=1
∑
n
​
 (Y
i
​
 −
Y
ˉ
 )
2

𝜎
𝑌
2
=
(
55
−
65
)*
2
+
(
60
−
65
)*
2
+
(
65
−
65
)*
2
+
(
70
−
65
)*
2
+
(
75
−
65
)*
2
5
=
100
+
25
+
0
+
25
+
100
/5
=
50
σ
Y
2
​
 =
/5
(55−65)
*2+(60−65) *2+(65−65) *2+(70−65) *2+(75−65)
*2

​
 =
/5
100+25+0+25+100
​
 =50
𝜎
𝑌
=
50
≈
7.071
σ
Y
​
 =
50
​
 ≈7.071





3.Calculate the correlation:

𝑟
=
50
7.071
×
7.071
=
50
50
=
1
r=
7.071×7.071
50
​
 =
50
50
​
 =1




So, the correlation between height and weight is 1.




**Interpretation of Results**



- Covariance: The covariance between height and weight is 50, which indicates that as height increases, weight also tends to increase. However, covariance is not normalized and depends on the units of measurement.



- Correlation: The correlation of 1 indicates a perfect positive linear relationship between height and weight. This means that for every unit increase in height, weight increases in a consistent and predictable manner. The correlation value ranges from -1 to 1, with:



- 1 indicating a perfect positive linear relationship.


- -1 indicating a perfect negative linear relationship.


- 0 indicating no linear relationship.



In this case, the correlation of 1 suggests a strong, direct relationship between height and weight in this dataset. However, in real-world scenarios, such perfect correlations are rare.











