## III. Describing Data with Python

### Finding the Mean

#### The mean is a common and intuitive way summarize a set of numbers. It's what we might simply call the "average" in everyday use, although as we'll see, there are other kinds of averages as well. Let's take sample set of numbers and calculate the mean

In [3]:
"""
Calculating the mean
"""

def calculate_mean(numbers):
    s = sum(numbers)
    N = len(numbers)
    # Calculate the mean
    mean = s / N
    
    return mean

if __name__ == '__main__':
    donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
    mean = calculate_mean(donations)
    N = len(donations)
    print('Mean donation over the last {0} days is {1}'. format(N, mean))


Mean donation over the last 12 days is 477.75


## Finding the Median

#### The median of a collection of numbers is another kind of average. To find the median, we sort the numbers in ascending order. If the length of the list of numbers is odd , the number in the middle of the lis is the median. If the length of the list of numbers is even, we get the median by taking the mean of the two middle numbers. Let's find the median of the previous list of donations: 100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, and 1200.

#### Before we write a program to find the median of a list of numbers, let's think about how we could automatically calculate the middle elements of alist in either case. If the length of a list(N) is odd, the middle numbers is the on in position (N+1)/2. If N is even, the two middle elements are N/2 and (N/2) + 1. 

### In order to write a function that calculates the median, we'll also need to sort a list in ascending order. 

In [1]:
samplelist = [4, 1, 3]
samplelist.sort()
samplelist

[1, 3, 4]

In [3]:
"""
Calculating the median
"""

def calculate_median(numbers):
    N = len(numbers)
    numbers.sort()
    
    # Find the median
    if N % 2 == 0:      # if N is even
        m1 = N/2
        m2 = (N/2) + 1
        # Convert to integer, match position (N/2 might have been a fraction, and lists do not accpet float as indexes)
        m1 = int(m1) - 1
        m2 = int(m2) - 1
        median = (numbers[m1] + numbers[m2])/2   
        
    else:
        m = (N+1) / 2
        # Convert to integer, match position
        m = int(m) - 1
        median = numbers[m]
             
    return median

if __name__ == '__main__':
    donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
    median = calculate_median(donations)
    N = len(donations)
    print('Median donation over the last {0} days is {1}'.format(N, median))

Median donation over the last 12 days is 500.0


### Finding the Mode and Creating a Frequency Table

#### Instead of finding the mean value or the median value of a set of numbers, what if you wanted to find the number that occurs most frequently? This numbers is called the mode.

#### There's no symbolic formula for calculating the mode -- you simply count how many times each unique number occurs and find the one that occurs the most

In [13]:
from collections import Counter
simplelist = [4, 2, 1, 3, 4]
c = Counter(simplelist)
c.most_common()

[(4, 2), (2, 1), (1, 1), (3, 1)]

#### The first element of the first tuple is the number that occurs most frequently, and the second element is the number of times it occurs. The second, third, and fourth tuples contain the other numbers along with the count of the number of times they appear.

In [7]:
c.most_common(2)

[(4, 2), (2, 1)]

In [8]:
mode = c.most_common(1)
mode

[(4, 2)]

In [9]:
mode[0]

(4, 2)

In [10]:
mode[0][0]

4

### Finding the Mode

In [14]:
"""
Calculating the mode
"""

from collections import Counter
def calculate_mode(numbers):
    c = Counter(numbers)
    mode = c.most_common(1)
    return mode[0][0]

if __name__ == '__main__':
    scores = [7, 8, 9, 2, 10, 9, 9, 9, 9, 4, 5, 6, 1, 5, 6, 7, 8, 6, 1, 10]
    mode = calculate_mode(scores)
    print('The mode of the list of numbers is : {0}'.format(mode))

The mode of the list of numbers is : 9


#### What if you have a set of data where two or more numbers occur the same maximum number of times? For example, in the list of numbers 5, 5, 5, 4, 4, 4, 9, 1, and 3, both 4 and 5 are present three times. In such cases, the list of numbers is said to have multiple modes, and our program should find and print all the modes. The modified program follows:

In [15]:
"""
Calculating the mode when the list of numbers may
have multiple modes
"""

from collections import Counter

def calculate_mode(numbers):
    c = Counter(numbers)
    numbers_freq = c.most_common()
    max_count = numbers_freq[0][1]  # max_count represents the times the most frequent appearing element appear
    
    modes = []
    for num in numbers_freq:
        if num[1] == max_count:  # and if the times of appear equals the max_count  
            modes.append(num[0])
    return modes

if __name__ == '__main__':
    scores = [5, 5, 5, 4, 4, 4, 9, 1, 3]
    modes = calculate_mode(scores)
    print('The mode(s) of the list of numbers are: ')
    for mode in modes:
        print(mode)

The mode(s) of the list of numbers are: 
5
4


### Creating a Frequency Table

In [1]:
"""
Frequency table for a list of numbers    
"""

from collections import Counter

def frequency_table(numbers):
    table = Counter(numbers)
    print("Number\tFrequency")
    for number in table.most_common():
        print('{0}\t{1}'.format(number[0], number[1]))
        
if __name__ == '__main__':
    scores = [7, 8, 9, 2, 10, 9, 9, 9, 9, 4, 5, 6, 1, 5, 6, 7, 8, 6, 1, 10]
    frequency_table(scores)

Number	Frequency
9	5
6	3
7	2
8	2
10	2
5	2
1	2
2	1
4	1


In [4]:
"""
Frequency table for a list of numbers 
Enhanced to display the table sorted by the numbers
"""
from collections import Counter

def freq_table(numbers):
    table = Counter(numbers)
    print('Numbers\tFrequency')
    numbers_freq = table.most_common()
    numbers_freq.sort()
    for number in numbers_freq:
        print('{0}\t{1}'.format(number[0], number[1]))
        
if __name__ == '__main__':
    scores = [7, 8, 9, 2, 10, 9, 9, 9, 9, 4, 5, 6, 1, 5, 6, 7, 8, 6, 1, 10]
    freq_table(scores)

Numbers	Frequency
1	2
2	1
4	1
5	2
6	3
7	2
8	2
9	5
10	2


### Measuring the Dispersion

#### The next statistical calculations we'll look at measure the dispersion, which tells us how far away the numbers in a set of data are from the mean of the data set.

### Finding the Range of a Set of Numbers

#### Once again, consider the list of donations during period A: 100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, and 1200. We found that the mean donation per day is 477.75. But just looking at the mean ,we have no idea whether all the donations fell into a narrow range--say between400 and 500 -- or whether they varied much more than that--say between 60 and 1200, as in this case. For a list of numbers, the range is the difference between the highest number and the lowest number. You could have two groups of numbers with the exact same mean but with vastly different ranges, so knowing the range fills in more information about a set of numbers beyond what we can learn from just looking at the mean, median, and mode.

In [5]:
"""
Find the range
"""

def find_range(numbers):
    lowest = min(numbers)
    highest = max(numbers)
    # Find the range
    r = highest - lowest
    
    return lowest, highest, r

if __name__ == '__main__':
    donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
    lowest, highest, r = find_range(donations)
    print('Lowest: {0} Highest: {1} Range: {2}'.format(lowest, highest, r))
    
    

Lowest: 60 Highest: 1200 Range: 1140


### Finding the Variance and the Standard Deviation

#### Now what do we do if we want to know the specific information about the individual numbers(all of them)? How they are varied from the mean? Were they all similar, clustered near the mean, or were they all different, closer to the extremes*

#### There are two related measures of dispersion that tell us more about alist of numbers along these lines: the variance and the standard deviation. To calculate either of these, we first need to find the difference of each of the numbers from the mean. The variance is the average of the squares of those differences. A high variance means that values are far from the mean; a low variance means that the values are clustered close to the mean. We calculate the variance using the formula $$variance = \frac{\Sigma(x_i - x_{mean})^2}{n}$$ 

#### In the formula, $x_i$ stands for individual numbers (in this case, daily total donations), $x_{mean}$ stands for the mean of these numbers (the mean daily donation), and $n$ is the number of values in the list (the number of days on which donations were received). For each value in the list, we take the difference between that number and the mean and square it. Then, we add all those squared differences together and, finally, divide the whole sum by $n$ to find the variance.

#### If we want to calculate the standard deviation as well, all we have to do is take the square root of the variance. Values that are within one standard deviation of the mean can be thought of as fairly typical, whereas values that are three or more standard deviations away from the can be considered much more atypical -- we call such values outliers.

In [1]:
"""
Find the variance and standard deviation of a list of numbers
"""

def calculate_mean(numbers):
    s = sum(numbers)
    N = len(numbers)
    # calculate the mean
    mean = s/N
    
    return mean

def find_differences(numbers):
    # Find the mean
    mean = calculate_mean(numbers)
    # Find the differences from the mean
    diff = []
    for num in numbers:
        diff.append(num - mean)
        
    return diff

def calculate_variance(numbers):
    # Find the list of differences
    diff = find_differences(numbers)
    # Find the squared differences
    squared_diff = []
    for d in diff:
        squared_diff.append(d**2)
    # Find the variance
    sum_squared_diff = sum(squared_diff)
    variance = sum_squared_diff/len(numbers)
    
    return variance

if __name__ == '__main__':
    donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
    variance = calculate_variance(donations)
    print('The variance of the list of numbers is {0}'.format(variance))
    
    std = variance**0.5
    print('The standard deviation of the list of numbers is {0}'.format(std))

The variance of the list of numbers is 141047.35416666666
The standard deviation of the list of numbers is 375.5627166887931


In [2]:
def calculate_mean(numbers):
    s = sum(numbers)
    N = len(numbers)
    # Calculate the mean
    mean = s/N
    return mean

def find_difference(numbers):
    mean = calculate_mean(numbers)
    diff = []
    for n in numbers:
        diff.append(n - mean)
        
    return diff

def calculate_variance(numbers):
    diff = find_difference(numbers)
    # Find the squared difference
    squared_diff = []
    for d in diff:
        squared_diff.append(d**2)
        
    # The sum of differences
    sum_squared_diff = sum(squared_diff)
    variance = sum_squared_diff / len(numbers)
    
    return variance

if __name__ == '__main__':
    donations = [382, 389, 377, 397, 396, 368, 369, 392, 398, 367, 393, 396]
    variance = calculate_variance(donations)
    print('The variance of this set of data is {0}'.format(variance))
    
    std = variance**0.5
    print('The standard deviation of this set of date is {0}'.format(std))

The variance of this set of data is 135.38888888888889
The standard deviation of this set of date is 11.63567311713804


### Calculating the Correlation Between Two Data Sets

#### In this section, we'll learn how to calculate a statistical measure that tells us the nature and the strength of the relationship between two sets of numbers: the Pearson correlation coefficient, which I'll call simply the correlation coefficient. Note that this coefficient measures the strength of the linear relationship(which does not include the non-linear one). 

####  A correlation coefficient of 0 indicates that there's no linear correlation between the two quantities.(Note that this does not mean the two quantities are independent of each other. There could still be a nonlinear relationship between them).

#### A coefficient of 1 or close to 1 indicates that there's a strong positive linear correlation; a coefficient of exactly 1 is referred to as perfect positive correlation. Similarly, a correlation coefficient of -1 or close to -1 indicates a strong negative correlation, where 1 indicates a perfect negative correlation.

## CORRELATION AND CAUSATION

#### In statistics, you'll often come across the statement "correlation doesn't imply causation." This is a reminder that even two sets of observations are strongly correlated with each other, that doesn't mean one variable causes the other. When two variables are strongly correlated, somtimes there's a third factor that influences both variables and explains the correlation. A classic examole is the correlation between ice cream sales and crime rates--if you track both of these variables in a typical city, you're likely to find a correlation but this does not mean that ice cream sales cause crime(or vice versa). Ice cream sales and crime are correlated because they both go up as the weather gets hotter during the summer. OF course, this does not mean that hot weather directly causes crime to go up either; there are more complicated causes behind that correlation as well. 

### Calculating the Correlation Coefficient