**Q1. What are the three measures of central tendency?**

In [1]:
# The three measures of central tendency are the mean, median, and mode. 
# These measures provide a way to describe the central value of a set of data.

# 1. Mean: The mean is calculated by summing up all the values in a dataset and dividing 
#          the sum by the total number of values. It is also referred to as the average.

# For example, let's say we have a dataset of exam scores: {85, 90, 92, 78, 88}.
# To find the mean, we add up all the scores (85 + 90 + 92 + 78 + 88 = 433) and 
# divide by the total number of scores (5). The mean is 433/5 = 86.6.


# 2. Median: The median is the middle value in a dataset when the values are arranged in 
#     ascending or descending order. If there is an odd number of values, the median is the value at 
#     the center. If there is an even number of values, the median is the average of the two middle values.

# For example, let's consider the dataset of exam scores: {85, 90, 92, 78, 88}. 
# When we arrange the scores in ascending order, we get {78, 85, 88, 90, 92}. 
# Since there is an odd number of values, the median is the middle value, which is 88.


# 3. Mode: The mode is the value that appears most frequently in a dataset. 
#     It can be useful for finding the most common value or category. In some cases, 
#     there may be multiple modes (bimodal, trimodal, etc.) or no mode at all (no value appears more than once).


# example, consider the dataset {85, 90, 92, 78, 88, 90, 92}. In this case, the mode is 90 and
# 92 because both values appear twice, which is more frequently than the other values.

**Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?**

In [3]:
# The mean, median, and mode are measures of central tendency that provide different information about 
# the center value of a dataset. Here are the key differences and uses of each measure:


# Mean: The mean is the average value of a dataset. It is calculated by summing up all the values and dividing 
#     by the total number of values. The mean is sensitive to extreme values, as it takes into account every value 
#     in the dataset. It is commonly used when the data is numerical and follows a roughly symmetric distribution.
#     The mean can be influenced by outliers, which are extreme values that are significantly different from the 
#     other values in the dataset.


# Median: The median is the middle value in a dataset when the values are arranged in ascending or 
#     descending order. It is not affected by extreme values or outliers, as it only considers the position of 
#     values. The median is especially useful when dealing with skewed distributions or datasets with outliers.
#     For example, if you have a dataset of income levels, where a few extremely high incomes are present, 
#     the median would provide a more representative measure of the typical income compared to the mean.


# Mode: The mode is the value or values that appear most frequently in a dataset. It is useful for
#     identifying the most common value or category in a dataset. The mode is often used with categorical 
#     or discrete data, such as survey responses or types of objects. It can also be used with numerical data,
#     but it may not always provide a meaningful measure of central tendency, especially when there are multiple 
#     modes or no mode at all.

**Q3. Measure the three measures of central tendency for the given height data**

In [4]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [11]:
import numpy as np

In [13]:
mean = np.mean(data)
print(mean)

177.01875


In [14]:
median = np.median(data)
print(median)

177.0


In [20]:
import statistics
mode = statistics.mode(data)
print(mode)

178


**Q4. Find the standard deviation for the given data:**

In [21]:
data2 = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [24]:
stand_divation = np.std(data2)
print(stand_divation)

1.7885814036548633


**Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.**

In [25]:
# Measures of dispersion, such as range, variance, and standard deviation, provide information about 
# the spread of a dataset. They quantify how much the values deviate from the central 
# tendency measures (such as mean or median). Here's an explanation of each measure and an example of their use:


# Range: The range is the simplest measure of dispersion and represents the difference between the maximum 
#     and minimum values in a dataset. It gives an idea of the spread of the dataset but does not take into 
#     account the distribution of values. For example, consider a dataset of exam scores: {85, 90, 92, 78, 88}.
#         The range is calculated as the difference between the maximum (92) and minimum (78) values, giving a
#         range of 92 - 78 = 14.


# Variance: Variance measures the average squared deviation of each data point from the mean. It provides a
#     more comprehensive measure of dispersion by considering the distribution of values. A higher variance 
#     indicates greater variability, while a lower variance suggests more consistency. For example, let's use 
#     the same dataset: {85, 90, 92, 78, 88}. First, we calculate the mean (sum of values divided by the number 
#     of values): (85 + 90 + 92 + 78 + 88) / 5 = 86.6. Next, we calculate the squared deviations of each value from
#     the mean: (85 - 86.6)^2, (90 - 86.6)^2, (92 - 86.6)^2, (78 - 86.6)^2, (88 - 86.6)^2. Finally, we calculate the 
#     average of these squared deviations, which gives us the variance.



# Standard Deviation: The standard deviation is the square root of the variance. It measures the average amount 
#     of dispersion or deviation from the mean in the dataset. The standard deviation is widely used because it 
#     is in the same unit as the original data, making it easier to interpret. Using the previous example, after 
#     calculating the variance, we take the square root of the variance to obtain the standard deviation.

**Q6. What is a Venn diagram?**

In [26]:
# A Venn diagram is a visual representation that uses overlapping circles or other shapes to show the relationships
# between different sets or groups of things. It helps to illustrate the common and unique elements or 
# characteristics among the sets being compared.

# In simple terms, think of a Venn diagram as a way to organize information by showing how different groups or
# categories overlap. Each circle in the diagram represents a specific group or category, and the overlapping 
# regions indicate the shared elements or attributes between those groups.

**Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A B
(ii) A ⋃ B**

In [33]:
A = (2,3,4,5,6,7)
B = (0,2,6,8,10)
a_intersaction_b =[]
a_union_b = []

for i in A:
    for j in B:
        if j==i:
            a_intersaction_b.append(j)

In [37]:
a_intersaction_b

[2, 6]

In [42]:
a_union_b = set(A).union(set(B))

In [43]:
a_union_b

{0, 2, 3, 4, 5, 6, 7, 8, 10}

**Q8. What do you understand about skewness in data?**

In [44]:
# Skewness is a statistical measure that helps us understand the shape of a distribution of data. 
# It tells us whether the data is concentrated more towards one tail of the distribution or whether it is balanced.

# In simple terms, if a dataset is positively skewed, it means that the tail on the right side of the distribution
# is longer or has more extreme values than the left side. This indicates that there are some unusually high values 
# that pull the distribution in that direction.


# On the other hand, if a dataset is negatively skewed, it means that the tail on the left side of the distribution
# is longer or has more extreme values than the right side. This suggests that there are some unusually low values 
# that drag the distribution in that direction.

# In a symmetric distribution, the data is evenly distributed around the mean, and there is no skewness.

# Skewness is important because it provides insights into the underlying characteristics of the data. 
# It helps us understand if there are any outliers or if the data is significantly deviating from a normal
# or symmetrical pattern. Skewness can impact the interpretation of statistical analyses and the choice of 
# appropriate methods for data analysis.

**Q9. If a data is right skewed then what will be the position of median with respect to mean?**

In [45]:
# If a dataset is right-skewed, it means that the tail on the right side of the distribution is longer or has more 
# extreme values than the left side. In this case, the distribution is pulled towards the higher values.

# When a dataset is right-skewed, the mean tends to be greater than the median. The reason is that the mean 
# is influenced by extreme values or outliers in the right tail, which pulls the mean towards higher values.
# As a result, the mean is typically larger than the median.

# To visualize this, imagine a right-skewed dataset with a long tail on the right side. The extreme values in the
# tail will have a greater impact on the mean, pulling it towards the right. On the other hand, the median 
# represents the middle value in the dataset, so it is less influenced by the extreme values in the tail. Therefore,
# the median will be relatively smaller than the mean.

# In summary, in a right-skewed distribution, the median will be smaller than the mean.

**Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?**

In [46]:
# Covariance and correlation are both measures of the relationship between two variables, but they have some 
# differences in terms of their interpretation and properties.

# Covariance:

# Covariance measures the extent to which two variables change together. It quantifies the directional 
# relationship between variables.

# Covariance can take on any value, positive or negative, depending on the direction of the relationship.

# The magnitude of covariance is not standardized and depends on the units of the variables.

# Covariance is affected by the scale of the variables, making it difficult to compare across different datasets or variables.


# Correlation:

# Correlation measures the strength and direction of the linear relationship between two variables.

# Correlation is always between -1 and 1, where 1 represents a perfect positive linear relationship, -1 represents 
# a perfect negative linear relationship, and 0 indicates no linear relationship.

# Correlation is a standardized measure, meaning it is not affected by the scale of the variables. It allows for 
# comparisons across different datasets and variables.

# Correlation is sensitive to linear relationships but may not capture non-linear relationships between variables.

**Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.**

In [47]:
# The formula for calculating the sample mean, denoted by x̄ (pronounced "x-bar"), is the sum of all the values 
# in the dataset divided by the total number of values in the dataset. Mathematically, it can be represented as:

# x̄ = (x₁ + x₂ + x₃ + ... + xn) / n

# where x₁, x₂, x₃, ..., xn are the individual values in the dataset, and n is the total number of 
# values in the dataset.

In [48]:
dataset = [80, 85, 90, 75, 95]

sample_mean = sum(dataset) / len(dataset)

print("Sample Mean:", sample_mean)


Sample Mean: 85.0


**Q12. For a normal distribution data what is the relationship between its measure of central tendency?**

In [50]:
# In a normal distribution, the measures of central tendency—mean, median, and mode—have a specific relationship:

# Mean: In a normal distribution, the mean is located at the center of the distribution. It coincides with the peak 
#     of the distribution, which is also the point of symmetry. Therefore, the mean is equal to the median in 
#     a normal distribution.

# Median: As mentioned above, the median of a normal distribution is equal to the mean. Since the normal distribution
#     is symmetric, the middle value or the median is the same as the mean value. This means that 50% of the data 
#     falls below the median, and 50% falls above it.

# Mode: In a normal distribution, the mode is also located at the peak of the distribution, which is the same as 
#     the mean and median. Thus, in a normal distribution, the mode is equal to the mean and median.

# In summary, for a normal distribution, the mean, median, and mode are all equal and located at the center of 
# the distribution. This indicates that the data is symmetrically distributed around this central value.

**Q13. How is covariance different from correlation?**

In [52]:
# Definition:

# Covariance measures the extent to which two variables change together. It quantifies the directional relationship
# between variables.
# Correlation measures the strength and direction of the linear relationship between two variables. It indicates 
# how closely the data points align to a straight line.

# Scale:

# Covariance is not standardized and can take on any value. The magnitude of covariance is affected by the units 
# of the variables.
# Correlation is always between -1 and 1. It is a standardized measure and is not influenced by the scale of 
# the variables.

# Interpretation:

# Covariance does not provide a clear indication of the strength or direction of the relationship between variables 
# due to its units.
# Correlation provides a more straightforward interpretation. A correlation value close to 1 or -1 indicates a 
# strong linear relationship, while a value close to 0 indicates a weak or no linear relationship.

# Comparability:

# Covariance values are not easily comparable across different datasets or variables due to their scale and units.
# Correlation values are standardized and allow for comparisons across different datasets or variables. They
# provide a consistent measure of the strength and direction of the linear relationship.

# Sensitivity to Outliers:

# Covariance is sensitive to outliers in the data and can be significantly affected by extreme values.
# Correlation is less sensitive to outliers, as it measures the strength of the linear relationship rather than 
# the actual values themselves.

**Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.**

In [54]:
# Outliers can have a significant impact on measures of central tendency and dispersion.
# Let's explore their effects with the help of an example:

# Consider the following dataset of exam scores: 75, 80, 85, 90, 95, 100, 500.

# Measures of Central Tendency:

# Mean: The mean is sensitive to outliers because it takes into account all the values in the dataset. 
#     In this example, the outlier value of 500 is much larger than the other scores. As a result, 
#     the mean is greatly influenced by this outlier and becomes an unreliable representation of the typical 
#     scores in the dataset. In this case, the mean is significantly higher than what might be expected based 
#     on the other scores.

# Median: The median is less affected by outliers since it represents the middle value in the dataset. 
#     In this example, the median is not influenced by the outlier value of 500 and remains the same as the 
#     value of 90. The median provides a more robust measure of central tendency in the presence of outliers.

# Mode: The mode is the most frequent value in the dataset. In this example, there is no mode as no value repeats.
#     Outliers typically do not impact the mode since it focuses on the frequency of values rather than their 
#     magnitude.

# Measures of Dispersion:

# Range: The range is the difference between the maximum and minimum values in the dataset. 
#     In this example, the outlier value of 500 significantly increases the range compared 
#     to the rest of the scores. Thus, the range becomes larger and does not accurately reflect the
#     spread of the typical scores.

# Variance and Standard Deviation: Both variance and standard deviation are sensitive to outliers.
#     The presence of an outlier, such as the value of 500, can substantially increase the variance 
#     and standard deviation. This is because these measures are based on the squared differences
#     from the mean, and outliers tend to have large squared differences, thereby inflating the measures.