Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. **Mean:** The mean, often referred to as the average, is calculated by adding up all the values in a data set and then dividing the sum by the number of values. It is sensitive to extreme values (outliers) in the data.

2. **Median:** The median is the middle value of a data set when it is ordered from least to greatest. If there is an even number of observations, the median is the average of the two middle values. The median is less sensitive to extreme values compared to the mean.

3. **Mode:** The mode is the value that appears most frequently in a data set. A data set may have no mode, one mode (unimodal), or more than one mode (multimodal). Unlike the mean and median, the mode is not necessarily unique, and a data set may have multiple modes or none at all.

Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

The mean, median, and mode are measures of central tendency that describe the center or typical value of a dataset. Here are the key differences between them and how they are used:

1. **Mean:**
   - Calculation: The mean is calculated by adding up all the values in a dataset and then dividing the sum by the number of values.
   - Sensitivity to Outliers: The mean is sensitive to extreme values (outliers) because it takes into account the magnitude of each value.
   - Use: It is often used when the data is symmetrically distributed and not highly skewed. However, it may not be the best measure when there are outliers that significantly affect the average.

2. **Median:**
   - Calculation: The median is the middle value of a dataset when it is ordered from least to greatest. If there is an even number of observations, the median is the average of the two middle values.
   - Insensitivity to Outliers: The median is less sensitive to extreme values compared to the mean because it depends only on the order of values, not their magnitudes.
   - Use: It is particularly useful when the dataset has outliers or is skewed, as it provides a more robust measure of central tendency.

3. **Mode:**
   - Calculation: The mode is the value that appears most frequently in a dataset.
   - Multimodal Distributions: A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (more than two modes), or it may have no mode at all.
   - Use: The mode is useful for categorical data and can be informative in describing the most common category or value in a dataset. However, it may not be appropriate for datasets with continuous variables.



Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [8]:
from statistics import mean, median, mode,stdev
arr = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
print(mode(arr))
print(mean(arr))
print(median(arr))

178
177.01875
177.0


Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [9]:
arr1 =  [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
stdev(arr)

1.8472389305844188

Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, are used to describe how spread out or dispersed the values in a dataset are. They provide information about the variability or consistency of the data points. Here's how each measure is used:

1. **Range:**
   - **Calculation:** Range is the difference between the maximum and minimum values in a dataset.
   - **Use:** It gives a quick and simple indication of how spread out the data is. However, it can be sensitive to extreme values.

2. **Variance:**
   - **Calculation:** Variance is the average of the squared differences from the mean. It quantifies the average distance of each data point from the mean.
   - **Use:** Variance provides a more detailed measure of the spread, emphasizing the magnitude of deviations. However, because it involves squared differences, its units are not the same as the original data.

3. **Standard Deviation:**
   - **Calculation:** Standard deviation is the square root of the variance. It is in the same units as the original data and provides a more interpretable measure of spread.
   - **Use:** Like variance, standard deviation quantifies the average distance of each data point from the mean. It is widely used due to its ease of interpretation and is more sensitive to changes in the data compared to the range.

**Example:**
Consider two datasets:

- Dataset A: [5, 10, 15, 20, 25]
- Dataset B: [5, 5, 15, 25, 25]

1. **Range:**
   - Range of Dataset A: \(25 - 5 = 20\)
   - Range of Dataset B: \(25 - 5 = 20\)

   Both datasets have the same range, but the distribution of values is different.

2. **Variance and Standard Deviation:**
   - Variance and standard deviation for Dataset A: 
      \[ \text{Variance} = 50, \quad \text{Standard Deviation} \approx 7.07 \]
   - Variance and standard deviation for Dataset B: 
      \[ \text{Variance} = 50, \quad \text{Standard Deviation} \approx 7.07 \]

   Again, both datasets have the same variance and standard deviation, indicating the same level of dispersion. However, the patterns of dispersion are different in terms of individual data points.

In summary, measures of dispersion help to quantify the spread of data and provide a more complete picture of the distribution of values in a dataset. The choice of which measure to use depends on the specific characteristics of the data and the goals of the analysis.

Q6. What is a Venn diagram?

A Venn diagram is a graphical representation that uses overlapping circles to illustrate the relationships between different sets or groups. Each circle typically represents a set, and the overlapping regions represent the elements that belong to more than one set. The size of the circles and the overlap can be adjusted to convey various relationships and proportions.

Key features of a Venn diagram:

Circles or Ellipses: Each circle or ellipse in a Venn diagram represents a set. The regions within each circle contain elements that belong exclusively to that set.

Overlap: The overlapping areas between circles represent the elements that are common to both sets. The extent of the overlap indicates the degree of intersection between the sets.

Universe: The entire space covered by the circles collectively represents the universal set or the complete set of elements under consideration.

![image.png](attachment:image.png)

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A B
(ii) A ⋃ B

In [14]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}
print(A.intersection(B))
print(A.union(B))

{2, 6}
{0, 2, 3, 4, 5, 6, 7, 8, 10}


Q8. What do you understand about skewness in data?

Skewness is a statistical measure that describes the asymmetry or lack of symmetry in a distribution of data. In a symmetrical distribution, the right and left sides are mirror images of each other. Skewness measures the extent and direction of skew (departure from horizontal symmetry) in a dataset.

There are three types of skewness:

Negative Skewness (Left Skewness): In a negatively skewed distribution, the left tail is longer or fatter than the right tail. The majority of the data points are concentrated on the right side, and the distribution has a tail that extends to the left.

Zero Skewness: A distribution is considered symmetric (zero skewness) if the left and right sides are approximately equal in length and shape. The mean, median, and mode are typically close to each other in symmetric distributions.

Positive Skewness (Right Skewness): In a positively skewed distribution, the right tail is longer or fatter than the left tail. The majority of the data points are concentrated on the left side, and the distribution has a tail that extends to the right.
    
    If skewness is less than -1 or greater than 1, the distribution is highly skewed.
If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

Q9. If a data is right skewed then what will be the position of median with respect to mean?

Mean > Median:
The mean is typically greater than the median.
The presence of higher values in the right tail pulls the mean in that direction, making it higher than the median.

Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

Covariance:

Covariance is a measure of how much two random variables vary together. It indicates the direction of the linear relationship between two variables. The covariance between two variables 
oints.

Correlation:

Correlation is a standardized measure that describes the strength and direction of a linear relationship between two variables. Unlike covariance, correlation is unitless and ranges between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The correlation coefficient 


Differences:

Scale:

Covariance is not standardized and its value depends on the units of the variables.
Correlation is standardized, and its value is independent of the units of the variables.
Interpretation:

Covariance can take any value, positive, negative, or zero. The magnitude indicates the strength of the relationship, but the sign doesn't have a clear interpretation.
Correlation has a clear interpretation: 
r=1 indicates a perfect positive linear relationship, 

r=−1 indicates a perfect negative linear relationship, and 

r=0 indicates no linear relationship.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

The sample mean, denoted by \(\bar{X}\), is the average of a set of observations in a sample. The formula for calculating the sample mean is as follows:

X
ˉ
 = 
n
∑ 
i=1
n
​
 X 
i
​
 
​

where:
- \(\bar{X}\) is the sample mean,
- \(X_i\) is the value of the ith observation,
- \(\sum\) denotes the sum of the values,
- \(n\) is the number of observations in the sample.

Let's go through an example calculation:

Suppose we have a dataset: \([12, 15, 18, 21, 24]\)

1. **Add up all the values:**
   \[ 12 + 15 + 18 + 21 + 24 = 90 \]

2. **Count the number of observations (\(n\)):**
   \[ n = 5 \]

3. **Calculate the sample mean:**
   \[ \bar{X} = \frac{90}{5} = 18 \]

So, the sample mean for this dataset is 18. The sample mean represents the central tendency of the data and is often used as a representative value for the entire sample.

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

mean = median = mode

Q13. How is covariance different from correlation?

already done!!!!!!

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can significantly influence measures of central tendency and dispersion in a dataset. Central tendency measures, such as the mean, are particularly sensitive to outliers, while measures of dispersion, like the range, variance, and standard deviation, can be greatly influenced by extreme values. Here's how outliers affect these measures:

**1. Mean:**
   - **Effect:** Outliers can heavily skew the mean. Since the mean is the sum of all values divided by the number of values, a single extreme value can have a substantial impact on the calculated mean.
   - **Example:** Consider the dataset: \([10, 12, 15, 18, 22, 100]\). The mean without the outlier is \(15.5\), but with the outlier, it becomes \(29.5\).

**2. Median:**
   - **Effect:** The median is less sensitive to outliers than the mean. It is the middle value when the data is ordered, so extreme values do not affect it as much.
   - **Example:** Using the same dataset, the median without the outlier is \(16.5\), and with the outlier, it remains \(16.5\).

**3. Range:**
   - **Effect:** Outliers can substantially affect the range, as it is the difference between the maximum and minimum values. A single extreme value can widen the range.
   - **Example:** In the dataset \([10, 12, 15, 18, 22, 100]\), the range without the outlier is \(90\), but with the outlier, it becomes \(90\) as well.

**4. Variance and Standard Deviation:**
   - **Effect:** Outliers increase the spread of data and, consequently, the variance and standard deviation. Squaring the differences from the mean (used in variance calculation) amplifies the impact of extreme values.
   - **Example:** In the dataset \([10, 12, 15, 18, 22, 100]\), the standard deviation without the outlier is \(30.47\), but with the outlier, it becomes \(37.83\).

In summary, outliers can distort measures of central tendency, pulling the mean towards extreme values, and can also inflate measures of dispersion, exaggerating the spread of data. It's important to be aware of the presence of outliers and consider their impact when interpreting these statistical measures. Depending on the context and goals of the analysis, it may be necessary to address or handle outliers appropriately.