### Q1. What are the three measures of central tendency?

A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics.

1. Mean: The mean is the sum of the value of each observation in a dataset divided by the number of observations. This is also            known as the arithmetic average. 

   Looking at the retirement age distribution again: 

                                  54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

   The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number    of observations (11) which equals 56.6 years.
   
   The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that           are unusual compared to the rest of the data set by being especially small or large in numerical value.
   
   
2. Median: The median is the middle value in distribution when the values are arranged in ascending or descending order.

   The median divides the distribution in half (there are 50% of observations on either side of the median value). In a            distribution with an odd number of observations, the median value is the middle value. 
   The median is less affected by outliers and skewed data than the mean and is usually the preferred measure of central            tendency when the distribution is not symmetrical. 

   Looking at the retirement age distribution (which has 11 observations), the median is the middle value, which is 57 years:  

                                 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60 

   When the distribution has an even number of observations, the median value is the mean of the two middle values.  In the        following distribution, the two middle values are 56 and 57, therefore the median equals 56.5 years: 

                               52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
                               
                               
3. Mode: The mode is the most commonly occurring value in a distribution. The mode has an advantage over the median and the mean    as it can be found for both numerical and categorical (non-numerical) data.  

    Consider this dataset showing the retirement age of 11 people, in whole years:

                                54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
                                
    The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

### Q2. What is the difference between the mean, median, and mode? How are they used to measure the  central tendency of a dataset?


#### Mean
The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data.

The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.
An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.

The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value.

we usually prefer the median over the mean (or mode) is when our data is skewed (i.e., the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal, the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. 

Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the mean is given by : 
                            (27 + 11 + 17 + 19 + 21) ÷ 5 = 19


#### Median

The median is the middle score for a set of data that has been arranged in order of magnitude. 

The median is less affected by outliers and skewed data.

Median is used over mean for finding central location when the distribution is not a normal distrbution and the data is right or left skewed.

Example : If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, 32 then the Median is given by :
          Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32, 36, 38

          N = 10 which is even then

          Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th position

          ⇒ Median = (Value at 5th position + Value at 6th position) ÷ 2

          ⇒ Median = (26 + 28) ÷ 2

          ⇒ Median = 27

### Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram.

Normally, the mode is used for categorical data where we wish to know which is the most common category.

one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency.

Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set.

Example: Find the mode of observations 5, 3, 4, 3, 7, 3, 5, 4, 3.
         Since 3 has occurred a maximum number of times i.e. 4 times in the given data;
         Hence, Mode of the given ungrouped data is 3.

#### Q3. Measure the three measures of central tendency for the given height data:

 [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import numpy as np
from scipy import stats

In [3]:
height = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [4]:
np.mean(height)

177.01875

In [5]:
np.median(height)

177.0

In [6]:
stats.mode(height)

  stats.mode(height)


ModeResult(mode=array([177.]), count=array([3]))

### Q4. Find the standard deviation for the given data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [8]:
data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [10]:
np.std(data)

1.7885814036548633

### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). When we discuss measures of spread, we are considering numeric values that are associated with how far our points are from one another. Measures of spread summarise the data in a way that shows how scattered the values are and how much they differ from the mean value. The spread of the values can be measured for quantitative data, as the variables are numeric and can be arranged into a logical order with a low end value and a high end value.

Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.

One of the most common ways to measure the spread of our data is to calculate the Five Number Summary, which consists of:

##### Minimum: The smallest number in the dataset.
##### First Quartile: The value such that 25% of the data falls below.
##### Second Quartile: The value such that 50% of the data falls below, i.e., Median
##### Third Quartile: The value such that 75% of the data falls below.
##### Maximum: The largest value in the dataset.

The 5 Number Summary gives us values for calculating the range and interquartile range.

Consider the following data-set:

5, 8, 3, 2, 1, 3, 10

To calculate the Five Number Summary, the first thing we need to do is order our values, which gives us

1, 2, 3, 3, 5, 8, 10

Once ordered, the minimum and maximum values are easy to identify. As we know, the median is the middle value in our dataset. We also call this Q2 or the second quartile because 50% of the data falls below this value. The remaining two values left to be calculated are Q1 and Q3. These values can be thought of as the medians of the data on either side of Q2. So in this case, as the median is 3, the median of values to the left of Q2 will give us the value of Q1 (2) and the median of values to the right of Q2 will give us the value of Q3 (8).

If the data-set has an even number of values, the value of Q2 (median), will be the mean of the middle 2 values. The value of Q1 will be the median of all values to the left of calculated Q2 and the value of Q3 will be the median of all values to the right of calculated Q2.

Range = Maximum — Minimum = 10–1 = 9
Interquartile Range = Q3 — Q1 = 8–2 = 6

##### Standard deviation
The most common way that professionals measure the spread of a data-set with a single value is with the Standard Deviation or Variance. The Standard Deviation tells us on average how far every data point is from the mean of the points.

Example: We wanted to know how far students were located from their school. One student might be 15 km, another 35km, another only 1 km and another might be living 60 km from the school. We could aggregate all of these distances together to show that the average distance (mean) between students and the school is 27.75 km. The Standard Deviation is how far, on average, these students are located from the mean distance.

(10–10)² = 0
(14–10)² = 16
(10–10)² = 0
(6–10)² = 16

The average of these values will give us the average squared distance of each observation from the mean, also known as the Variance. Variance = (0 + 16 + 0 + 16)/4 = 32/4 = 8

Standard Deviation = √8 = 2.83



### Q6. What is a Venn diagram?

A Venn diagram uses overlapping circles or other shapes to illustrate the logical relationships between two or more sets of items. Often, they serve to graphically organize things, highlighting how the items are similar and different.

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

(i) 	A ⋂ B

(ii)	A ⋃ B

A ⋂ B : (2,6)

A ⋃ B : (0,2,3,4,5,6,7,8,10)

### Q8. What do you understand about skewness in data?

Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. Skewness is demonstrated on a bell curve when data points are not distributed symmetrically to the left and right sides of the median on a bell curve. If the bell curve is shifted to the left or the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. 

The two most common types of skew are:

Negative skew:  A data set with a negative skew has a tail on the negative side of the graph, meaning the graph is skewed to the left.

Positive skew: A data set with a positive skew has a tail on the positive side of the graph, meaning the graph is skewed to the right.

###  Q9. If a data is right skewed then what will be the position of median with respect to mean?

A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean is also to the right of the peak.
if the data is right skewed, then the mean will shift towards right, reason might be an outlier. 
A right-skewed distribution will have the mean to the right of the median.

Mean > Median > Mode

### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

##### Covariance: 
* In the study of covariance only sign matters. A positive value shows that both variables vary in the same direction and negative value shows that they vary in the opposite direction. 

* The numerical value of covariance does not have any significance however if it is positive then both variables vary in the same direction else if it is negative then they vary in the opposite direction.

##### Correlation: 
* As covariance only tells about the direction which is not enough to understand the relationship completely, we divide the covariance with a standard deviation of x and y respectively and get correlation coefficient which varies between -1 to +1.

* 1 and +1 tell that both variables have a perfect linear relationship.

* Negative means they are inversely proportional to each other with the factor of correlation coefficient value.

* Positive means they are directly proportional to each other mean vary in the same direction with the factor of correlation coefficient value.

* if the correlation coefficient is 0 then it means there is no linear relationship between variables however there could exist other functional relationship.if there is no relationship at all between two variables then correlation coefficient will certainly be 0 however if it is 0 then we can only say that there is no linear relationship but there could exist other functional relationship.









### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The sample mean is an average value found in a sample.The sample mean can be used to calculate the central tendency, standard deviation and the variance of a data set. The sample mean can be applied to a variety of uses, including calculating population averages. 

Calculating sample mean is as simple as adding up the number of items in a sample set and then dividing that sum by the number of items in the sample set. To calculate the sample mean through spreadsheet software and calculators, you can use the formula:

x̄ = ( Σ xi ) / n

Here, x̄ represents the sample mean, Σ tells us to add, xi refers to all the X-values and n stands for the number of items in the data set.

Example : Find the sample mean for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404, 405.

Total numbers in set (n) = 26

sum of all numbers: 12 + 13 + 14 + 16 + 17 + 40 + 43 + 55 + 56 + 67 + 78 + 78 + 79 + 80 + 81 + 90 + 99 + 101 + 102 + 304 + 306 + 400 + 401 + 403 + 404 + 405 = 3744.

x = ( Σ xi ) / n
= 3744/26
= 144

In [12]:
import numpy as np

data = [12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404, 405]
np.mean(data)

144.0

### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In normal distribution, all the three measure of central tendency lies on the same point at the center. 

We can use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode.

### Q13. How is covariance different from correlation?

* Covariance reveals how two variables change together while correlation determines how closely two variables are related to each other.
* Covariance indicates the direction of the linear relationship between variables. Correlation measures both the strength and direction of the linear relationship between two variables.
* Correlation values are standardized. Covariance values are not standardized.
* While correlation coefficients lie between -1 and +1, covariance can take any value between -∞ and +∞.

### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.

In statistics, outliers cause the mean to increase, but if we have outliers to the left of the graph these outliers are dragging down the mean. This causes a conflict because the mean no longer provides a good representation of the data, alternatively we would much rather use the median. The median on the other hand is less likely to be affected by outliers.

In the example below one unusual score of a student has large impact on the mean for the entire dataset.

The range (the difference between the maximum and minimum values) is the simplest measure of spread. But if there is an outlier in the data, it will be the minimum or maximum value. Thus, the range is not robust to outliers. Neither the standard deviation nor the variance is robust to outliers. A data value that is separate from the body of the data can increase the value of the statistics by an arbitrarily large amount.

The interquartile range (IQR) is the difference between the 75th and 25th percentile of the data. Since only the middle 50% of the data affects this measure, it is robust to outliers.

In [13]:
import numpy as np

In [15]:
# student's score
score = [0,88,90,92,94,95,95,96,97,99]
np.mean(score)

84.6

In [16]:
#Removing the outlier : 0

score = [88,90,92,94,95,95,96,97,99]
np.mean(score)

94.0