## The Mean

Is a measure of central tendency that is obtained by taking the sum of values diving  by the total number of values in the distribution

In [1]:
distribution = [0,2,3,3,3,4,13]
mean = sum(distribution) / len(distribution)
center = False

above = []
below = []

for value in distribution:
    if value < mean:
        below.append(mean - value)
    if value > mean:
        above.append(value - mean)
        
equal_distances = (sum(above) == sum(below))

In [2]:
equal_distances

True

From above, we can think of the mean as the value located at that particular point in the distribution where the total distance of the values below the mean is the same as the total distance of the values that are above the mean.

##  Task

Generate 5000 different distributions, measure the total distances above and below the mean, and check whether they are equal. For each of the 5000 iterations of a for loop:

+ Set a seed using the seed() function from numpy.random. For the first iteration, the seed number should be 0, for the second iteration it should be 1, for the third it should be 2, and so on.
+ Generate randomly a distribution of integers using the `randint()` function from numpy.random. Pass the right arguments to randint() such that each distribution will:

        + Have 10 values.
        + The values can range from 0 to 1000.
+ Compute the mean of the distribution.
+ Measure the total distance above and below the mean.

       + Round off each distance to 1 decimal place using the `round()` function. This will prevent rounding errors at the 13th or 14th decimal place.
+ Compare the two sums. If they are equal, then increment a variable named equal_distances with 1. You'll need to define equal_distances outside the loop with a value of 0.

At the end equal_distances should have a value of 5000. This will confirm that for each of the 5000 distributions the total distance of the values above the mean is equal to the total distance of the values below the mean.


In [1]:
from numpy.random import randint, seed
equal_distances = 0

for i in range(5000):
    seed(i)
    distribution = randint(0,1000,10)
    mean = sum(distribution) / len(distribution)
    
    above = []
    below = []
    for value in distribution:
        if value == mean:
            continue # continue with the next iteration because the distance is 0
        if value < mean:
            below.append(mean - value)
        if value > mean:
            above.append(value - mean)
    
    sum_above = round(sum(above),1)
    sum_below = round(sum(below),1)
    if (sum_above == sum_below):
        equal_distances += 1

In [2]:
equal_distances

5000

Creating a function for calculating mean

In [3]:
distribution_1 = [42, 24, 32, 11]
distribution_2 = [102, 32, 74, 15, 38, 45, 22]
distribution_3 = [3, 12, 7, 2, 15, 1, 21]
def mean(distribution):
    sum_distribution = 0
    for value in distribution:
        sum_distribution += value
        
    return sum_distribution / len(distribution)

mean_1 = mean(distribution_1)
mean_2 = mean(distribution_2)
mean_3 = mean(distribution_3)

### Introducing the dataset

We'll be working with a data set that describes characteristics of houses sold between 2006 and 2010 in the city of Ames (located in the American state of Iowa). There are 2930 rows in the data set, and each row describes a house. For each house there are 82 characteristics described, which means there are 82 columns in the data set. 

A paper which describes the dataset [Ames Housing Dataset Details](data/Ames%20Iowa%20Alternative%20to%20the%20Boston%20Housing%20Data%20as%20an%20End%20of%20Semester%20Regression%20Project.pdf)

In [1]:
import pandas as pd

houses = pd.read_table('data/AmesHousing_1.txt')

Distribution of sales prices of houses

In [2]:
houses['SalePrice'].describe()

count      2930.000000
mean     180796.060068
std       79886.692357
min       12789.000000
25%      129500.000000
50%      160000.000000
75%      213500.000000
max      755000.000000
Name: SalePrice, dtype: float64

We can see that the distribution has a large range: the minimum sale price is $12789 while the maximum is $755000. Among this diversity of prices, we can see that the mean (or the "balance point") of this distribution is approximately $180796. The mean gives us a sense about the typical sale price in this distribution of 2930 prices.

Calculating just the mean

In [3]:
houses['SalePrice'].mean()

180796.0600682594

Comparing means calculated by two different ways

In [4]:
def mean(distribution):
    sum_distribution = 0
    for value in distribution:
        sum_distribution += value
        
    return sum_distribution / len(distribution)
function_mean = mean(houses['SalePrice'])
pandas_mean = houses['SalePrice'].mean()
means_are_equal = (function_mean == pandas_mean)

### Estimating population mean

In the exercise below, we'll try to visualize on a scatter plot how the sampling error changes as we increase the sample size. Just to prove a point, we'll assume that our data set describes all the houses sold in Ames, Iowa between 2006 and 2010.

#### Instructions



1. Compute the mean of the SalePrice variable. We'll assume that the data we have is a population relative to the question "What's the mean sale price of a house in Ames, Iowa for the period 2006-2010?".

2. For each iteration of a for loop that iterates 101 times:

  + Sample the SalePrice distribution using the Series.sample() method.
        + For the first iteration, the random_state parameter is 0, for the second iteration is 1, for the third is 2, and so on.
        + For the first iteration, the sample size is 5.
        + The last sample size is 2905 (which is close to 2930, the population's size).
        + To achieve that, you'll need to increment the sample size by 29 for every new iteration. Note that you'll first have to define the sample size with a value of 5 outside the loop.
 + Compute the sample mean.
 + Compute the sampling error. For answer checking purposes, use parameter−statistic, not statistic−parameter

3. Generate a scatter plot to represent visually how the sampling error changes as the sample size increases.

+ Place the sample sizes on the x-axis.
+ Place the sampling errors on the y-axis.
+ Use `plt.axhline()` to generate a horizontal line at 0 to illustrate the point where the sampling error is 0.
+ Use `plt.axvline()` to generate a vertical line at 2930 to illustrate the population size.
+ Label the x-axis "Sample size".
+ Label the y-axis "Sampling error".

In [None]:
parameter = houses['SalePrice'].mean()
sample_size = 5

sample_sizes = []
sampling_errors = []

for i in range(101):
    sample = houses['SalePrice'].sample(sample_size , random_state = i)
    statistic = sample.mean()
    sampling_error = parameter - statistic
    sampling_errors.append(sampling_error)
    sample_sizes.append(sample_size)
    sample_size += 29
    
import matplotlib.pyplot as plt
plt.scatter(sample_sizes, sampling_errors)
plt.axhline(0)
plt.axvline(2930)
plt.xlabel('Sample size')
plt.ylabel('Sampling error')