In [1]:
#1
#Mean,Median,Mode

In [2]:
#2
# The mean, median, and mode are all measures of central tendency used to describe the "typical" or central value of a dataset. However, they differ 
#in how they are calculated and what aspects of the data they emphasize.

# 1. Mean: The mean is calculated by adding up all the values in a dataset and dividing the sum by the number of values. It is also known as the 
#average. The mean is sensitive to extreme values, or outliers, as they can heavily influence its value. It is commonly used when the dataset follows
#a symmetrical distribution and there are no significant outliers.

# 2. Median: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If the dataset has an odd 
#number of values, the median is the middle one. If the dataset has an even number of values, the median is the average of the two middle values. 
#The median is less affected by outliers compared to the mean, making it a robust measure of central tendency. It is often used when dealing with 
#skewed distributions or when the presence of outliers can skew the results.

# 3. Mode: The mode is the value that appears most frequently in a dataset. Unlike the mean and median, the mode does not require numeric data. 
#A dataset can have one mode (unimodal) or multiple modes (multimodal) if there are multiple values with the same highest frequency. The mode is 
#useful for identifying the most common or popular value in a dataset, especially in categorical or discrete data.

# These measures can provide different insights into the central tendency of a dataset, and their appropriate usage depends on the nature of the 
#data and the specific context of analysis. It is common to consider all three measures together to gain a more comprehensive understanding of the 
#data distribution.

In [5]:
#3
import numpy as np
height = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
np.mean(height)

177.01875

In [6]:
np.median(height)

177.0

In [7]:
mode = max(set(height) , key = height.count)
print(mode)

177


In [8]:
#4
np.std(height)

1.7885814036548633

In [9]:
#5
# Measures of dispersion, such as range, variance, and standard deviation, provide information about the spread or variability of a dataset. 
#They help quantify how much the values in the dataset deviate from the central tendency (mean, median, or mode).

# 1. Range: The range is the simplest measure of dispersion and represents the difference between the maximum and minimum values in a dataset. 
#It gives an idea of the spread of values but does not consider the distribution of values within the dataset. For example, if you have a dataset 
#of exam scores ranging from 60 to 90, the range would be 30 (90 - 60).

# 2. Variance: Variance measures how much the values in a dataset vary from the mean. It calculates the average of the squared differences between
#each data point and the mean. A higher variance indicates greater variability or spread of data points. To calculate the variance, you can use the 
#following formula:

#    Once you calculate the variance, it provides a measure of how spread out the values are from the mean.

# 3. Standard Deviation: The standard deviation is the square root of the variance. It measures the average amount of deviation or dispersion from 
#the mean in the original units of the dataset. The standard deviation is widely used as it is more interpretable and easier to relate to the dataset
#compared to the variance. It helps understand the typical amount by which data points deviate from the mean. A higher standard deviation implies a
#greater spread of values.

#    To calculate the standard deviation, you can take the square root of the variance:

#    For example, if you have a dataset of exam scores with a mean of 75 and a standard deviation of 10, it indicates that most scores deviate by 
#approximately 10 points from the mean.

# These measures of dispersion provide valuable insights into the spread and variability of data points in a dataset, helping to understand the 
#distribution and make comparisons between different datasets.

In [10]:
#6
# A Venn diagram is a graphical representation of the relationships between different sets of objects or data. It consists of overlapping circles or 
#ellipses, where each circle represents a set, and the overlapping regions represent the intersections between sets. Venn diagrams are used to 
#visually illustrate the logical relationships and commonalities between different sets of elements.

# The primary purpose of a Venn diagram is to depict the set relationships through the arrangement of the circles and the overlapping or 
#non-overlapping areas. The diagram can show the following set relationships:

# Intersection: The overlapping region(s) in the diagram represent the elements that are common to all the sets being compared. It shows the 
#intersection or overlap between the sets.

# Union: The entire area enclosed by the circles represents the union of all the sets, including both the elements that are common and those that are
#unique to each set.

# Disjoint sets: If the circles do not overlap, it indicates that the sets have no elements in common. They are completely separate or disjoint.

In [11]:
#7
#1 --> (2,6)
#2 --> (0,2,3,4,5,6,7,8,10)

In [12]:
#8
# Skewness in data refers to the asymmetry or lack of symmetry in the distribution of values. It measures the degree and direction of departure from 
#a symmetric distribution. Skewness provides information about the shape of the distribution and the concentration of values on either side of the 
#central tendency (mean, median, or mode).

# There are three types of skewness:

# 1. Positive Skewness (Right Skewness): In a positively skewed distribution, the tail of the distribution extends towards the higher values, and 
#the majority of the data is concentrated on the left side. The mean tends to be larger than the median, and the mode may be the smallest value. 
#It indicates that there are relatively more low values in the dataset. Visually, the distribution appears stretched to the right.

# 2. Negative Skewness (Left Skewness): In a negatively skewed distribution, the tail of the distribution extends towards the lower values, and 
#the majority of the data is concentrated on the right side. The mean tends to be smaller than the median, and the mode may be the largest value. 
#It suggests that there are relatively more high values in the dataset. Visually, the distribution appears stretched to the left.

# 3. Zero Skewness: A distribution with zero skewness is perfectly symmetrical. The data is evenly distributed around the central tendency, and 
#the mean, median, and mode are approximately equal.

# Skewness is an important statistical measure as it provides insights into the nature and characteristics of the data distribution. It helps 
#identify departures from normality and assesses the presence of outliers. Skewness is commonly used in fields such as finance, economics, and 
#data analysis to understand the distributional properties of variables and make appropriate inferences.

In [13]:
#9
#median value is smaller than mean value

In [14]:
#10
# Covariance and correlation are both measures used in statistical analysis to quantify the relationship between two variables. However, they differ
#in their interpretation and scale.

# 1. Covariance: Covariance measures the degree to which two variables vary together. It provides information about the direction (positive or 
#negative) and magnitude of the linear relationship between the variables. A positive covariance indicates that as one variable increases, 
#the other tends to increase as well, while a negative covariance indicates an inverse relationship.

#    Covariance values are not standardized and can range from negative infinity to positive infinity. Therefore, it is difficult to compare 
#covariance values directly or draw conclusions about the strength of the relationship between variables.

# 2. Correlation: Correlation measures the strength and direction of the linear relationship between two variables. Unlike covariance, 
#correlation is standardized and falls within a fixed range of -1 to +1. A correlation of +1 indicates a perfect positive linear relationship,
#a correlation of -1 indicates a perfect negative linear relationship, and a correlation of 0 indicates no linear relationship.

#    The resulting correlation coefficient, denoted by ρ (rho), ranges from -1 to +1. A positive correlation coefficient indicates a positive 
#relationship, a negative correlation coefficient indicates a negative relationship, and a correlation coefficient close to zero suggests little 
#to no linear relationship.

#    Correlation is useful for understanding the strength and direction of the relationship between variables. It is commonly used in various fields,
#such as economics, social sciences, and finance, to analyze the dependence between variables, identify patterns, make predictions, and assess the 
#significance of relationships in statistical models.

In [16]:
#11
# Sample Mean = (Sum of all values in the dataset) / (Number of values in the dataset)

# To calculate the sample mean, you sum up all the values in the dataset and divide the sum by the total number of values.

# Let's take an example dataset of exam scores: 78, 85, 92, 70, 88. 

# To calculate the sample mean, you add up all the values (78 + 85 + 92 + 70 + 88) and then divide the sum by the total number of values (which is 5 in this case):

# Sample Mean = (78 + 85 + 92 + 70 + 88) / 5 = 413 / 5 = 82.6

# Therefore, the sample mean of the given dataset is 82.6.

In [17]:
#12
#mean = median = mode

In [18]:
#13
#explained in que no.10

In [None]:
#14
# Outliers can significantly affect measures of central tendency and dispersion, influencing their values and interpretation. 
#Here's how outliers impact these measures:

# 1. Measures of Central Tendency (e.g., mean, median, mode):
#    - Mean: Outliers can have a substantial impact on the mean, as it takes into account the value of each data point. A single extreme outlier 
#can pull the mean towards its direction, making it no longer representative of the typical value in the dataset.
#    - Median: Outliers have less influence on the median compared to the mean. The median represents the middle value in the dataset, 
#so extreme outliers have a lesser effect as they do not affect the ordering of the other values. The median is often considered a robust 
#measure of central tendency in the presence of outliers.
#    - Mode: Outliers generally do not affect the mode as it represents the most frequently occurring value(s) in the dataset. 
#Outliers that occur infrequently would not impact the mode significantly.

# 2. Measures of Dispersion (e.g., range, variance, standard deviation):
#    - Range: Outliers have a direct impact on the range since it is calculated as the difference between the maximum and minimum values.
#An extreme outlier can widen the range substantially, leading to an overestimation of the spread of the dataset.
#    - Variance and Standard Deviation: Outliers can heavily influence variance and standard deviation calculations as they involve squaring the differences between each data point and the mean. The squared differences become magnified, resulting in larger variance and standard deviation values, indicating increased variability in the dataset. Both measures are sensitive to outliers.
   
# To illustrate this, let's consider the following dataset of exam scores: 78, 85, 92, 70, 88, and an outlier of 150.

# - Mean: Adding the outlier of 150 would significantly increase the mean, as it is much larger than the other values.
# - Median: The median would remain relatively unaffected by the outlier, as it represents the middle value and is not affected by extreme values.
# - Mode: The mode would still be the same, as the outlier is an infrequent value and does not change the most commonly occurring value(s).

# - Range: The range would increase substantially due to the outlier, as it would be the difference between the outlier (150) and the minimum value (70).
# - Variance and Standard Deviation: The outlier would significantly affect both measures, as the squared differences from the outlier would be much larger than the squared differences from the other values, leading to higher variance and standard deviation values.

# In summary, outliers can distort measures of central tendency and dispersion, particularly the mean, variance, and standard deviation. It is crucial to consider outliers carefully and, if necessary, apply outlier detection and removal techniques or use alternative robust measures that are less affected by extreme values.