In [None]:
"""
Q1. What are the three measures of central tendency?
Ans :
The three measures of central tendency are:
1. Mean: The mean is the average of a set of numbers.
          It is calculated by adding up all the values in the dataset and then dividing the sum by the total number of values.
          The mean can be affected by outliers, as they can significantly skew the average.

2. Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order.
            If there's an odd number of values, the median is the middle one.
            If there's an even number of values, the median is the average of the two middle values.
            The median is less affected by outliers compared to the mean, making it a more robust measure of central tendency in the presence of extreme values.

3. Mode: The mode is the value that appears most frequently in a dataset.
          A dataset can have no mode (when all values are unique) or multiple modes (when more than one value occurs with the highest frequency).
          The mode is useful for categorical or discrete data where you're interested in finding the most common category or value.
"""


In [None]:
"""
Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?
Ans :
1. The mean provides the arithmetic average and is suitable for data that follows a normal distribution without extreme outliers.
    Formula: Mean = (Sum of all values) / (Total number of values)

2. The median is the middle value that's less affected by outliers, making it suitable for skewed or non-normally distributed data.
    No specific formula, but involves sorting the data and finding the middle value(s).

3. The mode highlights the most common value or category, and it's useful for categorical or discrete data.
    No specific formula; it's the value(s) with the highest frequency.

"""


In [9]:
"""
Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
"""
import numpy as np
from scipy import stats

height_data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

print("Mean =", np.mean(height_data))
print("Median =", np.median(height_data))
print("Mode =", stats.mode(height_data))

Mean = 177.01875
Median = 177.0
Mode = ModeResult(mode=array([177.]), count=array([3]))


  print("Mode =", stats.mode(height_data))


In [10]:
"""
Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
"""
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
print("Standard Deviation = ", np.std(data))


Standard Deviation =  1.7885814036548633


In [11]:
"""
Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.
Ans :
Measures of dispersion, such as range, variance, and standard deviation, are used to quantify how spread out or scattered the values in a dataset are.
They provide insights into the variability or dispersion of data points around the central tendency (mean, median, etc.).

1. Range:
  The range is the difference between the maximum and minimum values in a dataset.
  It gives a simple idea of the spread of data but can be greatly affected by outliers.
  Formula: Range = Maximum Value - Minimum Value

2. Variance:
  Variance measures the average of the squared differences between each data point and the mean of the dataset.
  It gives a more detailed understanding of how individual values deviate from the mean.
  Formula: Variance = (Sum of (Data Point - Mean)^2) / (Total Number of Data Points)

3. Standard Deviation:
  The standard deviation is the square root of the variance.
  It provides a measure of how much individual data points deviate from the mean, in the same units as the original data.
  A higher standard deviation indicates greater variability in the dataset.
  Formula: Standard Deviation = Square Root of Variance

"""
import numpy as np

# Sample dataset
data = [12, 15, 18, 22, 25, 29, 31, 32, 35, 38]

# Calculate range
data_range = np.max(data) - np.min(data)

# Calculate variance
data_variance = np.var(data)

# Calculate standard deviation
data_std_dev = np.std(data)

print("Dataset:", data)
print("Range:", data_range)
print("Variance:", data_variance)
print("Standard Deviation:", data_std_dev)



Dataset: [12, 15, 18, 22, 25, 29, 31, 32, 35, 38]
Range: 26
Variance: 69.21
Standard Deviation: 8.319254774317228


In [None]:
"""
Q6. What is a Venn diagram?
Ans :
A Venn diagram is a graphical representation used to depict the relationships between different sets or groups.
It consists of overlapping circles or other shapes, where each circle represents a set, and the overlapping areas represent the intersection between those sets.
Venn diagrams are often used to visually illustrate the similarities and differences between various elements or categories.

A U B (A Union B):
The union of two sets, A and B, denoted as A U B, represents the collection of all elements that belong to either set A, or set B, or both. In other words, it combines all the unique elements from both sets without duplication.
For example, if A = {1, 2, 3} and B = {3, 4, 5}, then A U B = {1, 2, 3, 4, 5}, as it includes all distinct elements from both sets.

A ∩ B (A Intersection B):
The intersection of two sets, A and B, denoted as A ∩ B, represents the collection of elements that belong to both set A and set B. In other words, it shows the common elements shared between the two sets.
For example, if A = {1, 2, 3} and B = {3, 4, 5}, then A ∩ B = {3}, as it includes only the element that is present in both sets.

"""

In [None]:
"""
Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A ∩ B
(ii) A ⋃ B

Ans :
(i) A ∩ B = {2, 6} (Common elements in both sets A and B)
(ii) A ⋃ B = {0, 2, 3, 4, 5, 6, 7, 8, 10} (All unique elements from sets A and B)
"""


In [None]:
"""
Q8. What do you understand about skewness in data?
Ans :
Skewness is a statistical measure that describes the asymmetry or lack of symmetry in the distribution of data points in a dataset.
In other words, it quantifies the degree to which a dataset's distribution is skewed or tilted to one side.

There are three main types of skewness:
1. Positive Skewness (Right Skewness):
  In a positively skewed distribution, the tail of the distribution extends more towards the right side (higher values) of the data.
  This means that there are few data points with very high values that pull the mean in that direction, while the majority of the data lies on the left side.

2. Negative Skewness (Left Skewness):
  In a negatively skewed distribution, the tail of the distribution extends more towards the left side (lower values) of the data.
  This means that there are few data points with very low values that pull the mean in that direction, while the bulk of the data is on the right side.

3. Symmetric Distribution:
  In a symmetric distribution, the data points are evenly distributed on both sides of the mean, and the distribution appears balanced without any skewness.
"""

In [None]:
"""
Q9. If a data is right skewed then what will be the position of median with respect to mean?
ANs :
If a dataset is right-skewed, it means that the distribution of data is stretched out more towards the right side,
which is characterized by a long tail on the right-hand side of the distribution. In
 this case, there are relatively few extremely high values that pull the mean to the right, away from the bulk of the data.

The position of the median with respect to the mean in a right-skewed distribution is typically influenced by the skewness itself:
Mean > Median > Mode


"""

In [None]:
"""
Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?
Ans :
Covariance and correlation are both statistical measures used to quantify the relationship between two variables

1. Covariance: Covariance measures the degree to which two variables change together.
                It indicates whether an increase in one variable corresponds to an increase or decrease in another variable.
                A positive covariance suggests that the variables tend to increase together, while a negative covariance suggests that one variable tends to increase when the other decreases.

2. Correlation: Correlation measures the strength and direction of the linear relationship between two variables.
                It is a standardized measure that falls between -1 and 1.
                A correlation coefficient of
                  +1 indicates a perfect positive linear relationship,
                  -1 indicates a perfect negative linear relationship, and
                  0 indicates no linear relationship.

"""


In [14]:
"""
Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.
Ans : The formula for calculating the sample mean is the sum of all the individual data points in the dataset divided by the total number of data points in the sample.

Consider the dataset: 10, 15, 20, 25, 30.

To calculate the sample mean:
Sum of all elements = 10 + 15 + 20 + 25 + 30 = 100
No of elements = 5
so Mean = Sum of all elements / No of elements = 100 / 5 = 20
"""
data = [10, 15, 20, 25, 30]
print("Mean = " , sum(data)/len(data))

Mean =  20.0


In [15]:
"""
Q12. For a normal distribution data what is the relationship between its measure of central tendency?
Ans :
For a normal distribution, which is also known as a Gaussian distribution or bell curve, there is a specific relationship between its measures of central tendency: mean, median, and mode.
This relationship is a key characteristic of normal distributions and contributes to their symmetrical shape. The relationship is as follows:

1. The mean, median, and mode are all equal and located at the center of the distribution.
2. The distribution is perfectly symmetric around the mean, with half of the data points falling to the left and half to the right of the mean.

"""





In [None]:
"""
Q13. How is covariance different from correlation?
Covariance: Covariance measures the degree to which two variables change together.
            It indicates the direction of the linear relationship between two variables.
            A positive covariance suggests that the variables tend to increase together, while a negative covariance suggests that one variable tends to increase as the other decreases.
            Covariance is influenced by the scale of the variables and is not standardized, making it challenging to interpret the strength of the relationship directly from its value.

Correlation: Correlation measures the strength and direction of the linear relationship between two variables,
            while standardizing the measure between -1 and 1, making it easier to interpret and compare.
            A correlation coefficient of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
            Correlation accounts for both the strength of the relationship and the scale of the variables, making it a more reliable measure of association.

"""

In [16]:
"""
Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.
Ans :
Outliers are data points that deviate significantly from the rest of the dataset.
They can have a notable impact on measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation) in various ways.

"""

import numpy as np

# Original dataset
data = np.array([25, 27, 30, 32, 33, 34, 36, 38, 40, 120])

# Calculate measures of central tendency
mean = np.mean(data)
median = np.median(data)
mode = np.argmax(np.bincount(data))  # Mode using bincount

# Calculate measures of dispersion
range_val = np.max(data) - np.min(data)
variance = np.var(data)
std_deviation = np.std(data)

print("Original Dataset:")
print("Data:", data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", range_val)
print("Variance:", variance)
print("Standard Deviation:", std_deviation)

# Add an outlier
data_with_outlier = np.append(data, 200)

# Calculate measures of central tendency
mean_outlier = np.mean(data_with_outlier)
median_outlier = np.median(data_with_outlier)
mode_outlier = np.argmax(np.bincount(data_with_outlier))  # Mode using bincount

# Calculate measures of dispersion
range_val_outlier = np.max(data_with_outlier) - np.min(data_with_outlier)
variance_outlier = np.var(data_with_outlier)
std_deviation_outlier = np.std(data_with_outlier)

print("\nDataset with Outlier:")
print("Data:", data_with_outlier)
print("Mean:", mean_outlier)
print("Median:", median_outlier)
print("Mode:", mode_outlier)
print("Range:", range_val_outlier)
print("Variance:", variance_outlier)
print("Standard Deviation:", std_deviation_outlier)


Original Dataset:
Data: [ 25  27  30  32  33  34  36  38  40 120]
Mean: 41.5
Median: 33.5
Mode: 25
Range: 95
Variance: 704.05
Standard Deviation: 26.533940529065788

Dataset with Outlier:
Data: [ 25  27  30  32  33  34  36  38  40 120 200]
Mean: 55.90909090909091
Median: 34.0
Mode: 25
Range: 175
Variance: 2716.2644628099174
Standard Deviation: 52.117794109209164
