#### Q1. What are the three measures of central tendency?

The three measures of central tendency are:

Mean: The arithmetic average of a set of numbers. It is calculated by adding up all the values in the set and dividing by the total number of values.
Median: The middle value in a set of ordered numbers. If the set has an even number of values, the median is the average of the two middle values.
Mode: The most frequently occurring value in a set of numbers. A set can have one or more modes, or no mode if all values occur with equal frequency.

#### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are all measures of central tendency used to describe the central or typical value of a dataset.

The mean is the arithmetic average of a set of numbers, obtained by adding all the values in the dataset and then dividing by the total number of values. It is the most commonly used measure of central tendency and is sensitive to outliers. It is calculated as:

mean = (sum of all values) / (number of values)

The median is the middle value in a dataset when the values are arranged in ascending or descending order. It is less sensitive to outliers than the mean and is useful when the dataset contains extreme values or outliers. It is calculated as:

If the number of values is odd, the median is the middle value.
If the number of values is even, the median is the average of the two middle values.
The mode is the most frequently occurring value in a dataset. It is useful for categorical or discrete data, but it can also be used for continuous data. It is not affected by outliers and can be used for skewed data. If there is no value that appears more than once, then the dataset has no mode.

The choice of measure of central tendency depends on the type of data and the research question. If the data is normally distributed and has no outliers, the mean is often used. If the data is skewed or has outliers, the median is a better choice. If the data is categorical or discrete, the mode is used. It is important to report the measure of central tendency used along with any other relevant statistics and to interpret it in the context of the research question.

#### Q3. Measure the three measures of central tendency for the given height data:
##### [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import numpy as np

height_data = np.array([178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5])

# Mean
mean = np.mean(height_data)
print("Mean height:", mean)

# Median
median = np.median(height_data)
print("Median height:", median)

# Mode
mode = np.round(np.mean(height_data[height_data == np.max(np.unique(height_data, return_counts=True)[0])]), 1)
print("Mode height:", mode)

Mean height: 177.01875
Median height: 177.0
Mode height: 180.0


#### Q4. Find the standard deviation for the given data:
#### [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import statistics

data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)

Mean: 177.01875
Median: 177.0
Mode: 178


#### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, are used to describe how spread out a dataset is around the central tendency, which is typically represented by the mean, median, or mode.

The range is the difference between the largest and smallest values in a dataset and provides a simple measure of the spread. However, it is sensitive to outliers and can be misleading if extreme values are present in the data.

The variance and standard deviation are more sophisticated measures of dispersion that take into account the deviation of each data point from the mean. The variance is the average of the squared differences between each data point and the mean, while the standard deviation is the square root of the variance. A higher variance or standard deviation indicates that the data is more spread out around the mean.

#### Q6. What is a Venn diagram?

A Venn diagram is a graphical representation of sets, where each set is represented by a circle or a closed curve, and the circles or curves intersect to show the relationships between the sets. The areas of overlap between the sets show the elements that belong to more than one set, while the areas outside the circles or curves show the elements that do not belong to any of the sets. Venn diagrams are commonly used in mathematics, statistics, logic, and computer science to illustrate set operations, such as union, intersection, and complement. They can also be used to show the relationships between different categories or attributes of data in fields such as marketing, social sciences, and biology.

#### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
#### (i) A B
#### (ii) A ⋃ B

(i) A ∩ B = (2, 6)
(ii) A ∪ B = (0, 2, 3, 4, 5, 6, 7, 8, 10)

#### Q8. What do you understand about skewness in data?

Skewness is a measure of the degree of asymmetry of a probability distribution. In other words, it measures the deviation of a probability distribution from a normal distribution. If a distribution is symmetric, it has zero skewness. If the tail of the distribution is longer on the right-hand side, then it is said to be positively skewed or right-skewed, and the skewness value will be positive. Conversely, if the tail of the distribution is longer on the left-hand side, it is said to be negatively skewed or left-skewed, and the skewness value will be negative. Skewness can be measured using the skewness formula.

Skewness is an important concept in statistics because it affects the interpretation of the central tendency measures. For example, if the data is positively skewed, the mean will be greater than the median, and if it is negatively skewed, the mean will be less than the median. Therefore, understanding the skewness of data is important to make accurate conclusions about the distribution of the data.

#### Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a data is right skewed, the median will be to the left of the mean. The mean is pulled towards the long tail in the right side, so the median, which is not influenced by outliers, will be smaller than the mean.

#### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two measures of the relationship between two variables in a dataset.

Covariance is a measure of how two variables vary together. Specifically, it measures how much two variables move in the same direction (positive covariance) or in opposite directions (negative covariance). A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases, and vice versa. However, covariance does not have a standardized scale, which makes it difficult to compare covariances across different datasets. The formula for covariance is:

cov(X,Y) = E[(X - E[X])(Y - E[Y])]

where X and Y are the two variables, and E[X] and E[Y] are the means of X and Y, respectively.

Correlation, on the other hand, is a standardized measure of the relationship between two variables. Correlation measures the strength and direction of the linear relationship between two variables on a scale from -1 to 1. A correlation of 1 indicates a perfect positive linear relationship, while a correlation of -1 indicates a perfect negative linear relationship. A correlation of 0 indicates no linear relationship between the variables. Correlation is useful because it allows us to compare the strength of the relationship between two variables across different datasets. The formula for correlation is:

corr(X,Y) = cov(X,Y) / (std(X) * std(Y))

where cov(X,Y) is the covariance between X and Y, and std(X) and std(Y) are the standard deviations of X and Y, respectively.

Both covariance and correlation are used in statistical analysis to understand the relationship between two variables. However, correlation is more commonly used because it is standardized and allows for easier comparison between different datasets.

#### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

#### x̄ = (x1+ x2+ x3+ … + xn )/n where x1, x2, . . ., xn are n observations

Example:

Suppose we have the following dataset:

[10, 15, 20, 25, 30]

To calculate the sample mean, we first add up all the values:

10 + 15 + 20 + 25 + 30 = 100
Then divided by 5 i.e.
100/5=20

#### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the three measures of central tendency (mean, median, and mode) are equal to each other. This means that the center of the distribution is at the same point, and the distribution is symmetrical around this point.

#### Q13. How is covariance different from correlation?

Covariance and correlation are two measures used to describe the relationship between two variables in a dataset.

Covariance measures the direction and strength of the linear relationship between two variables. It can be positive, negative, or zero. A positive covariance indicates that both variables move in the same direction, while a negative covariance indicates that they move in opposite directions. A covariance of zero means that there is no linear relationship between the variables.

Correlation measures the strength of the linear relationship between two variables on a scale from -1 to 1. A correlation of 1 means that there is a perfect positive linear relationship between the variables, while a correlation of -1 means that there is a perfect negative linear relationship between the variables. A correlation of 0 means that there is no linear relationship between the variables.

#### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant impact on measures of central tendency and dispersion. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

When calculating measures of central tendency, such as the mean or median, outliers can significantly affect the result. For example, consider a dataset of exam scores with values ranging from 60 to 100, with one outlier score of 20. If we calculate the mean of this dataset, the outlier score will bring down the overall average, resulting in a distorted representation of the dataset.

Similarly, when calculating measures of dispersion, outliers can also have a significant impact. The range, variance, and standard deviation all incorporate every value in the dataset, so an outlier can pull these measures far from the majority of the data.

For example, consider a dataset of household incomes ranging from $20,000 to $150,000, with one outlier income of $1,000,000. This outlier will drastically increase the range of the dataset, making it appear much more spread out than it really is. It will also inflate the variance and standard deviation, making it difficult to get an accurate picture of the typical income in the dataset.