# Understanding Descriptive Statistics

Import the necessary libraries here:

In [1]:
# Libraries

import pandas as pd
import numpy as np
from scipy import stats

# Show all columns in pandas
pd.set_option('display.max_columns', None)

# Remove warnings (not necessary)
import warnings
warnings.filterwarnings('ignore')

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

In [2]:
import random

## Challenge 1
#### 1.- Define a function that simulates rolling a dice 10 times. Save the information in a dataframe.
**Hint**: you can use the *choices* function from module *random* to help you with the simulation.

In [3]:
# your code here
dice = [1, 2, 3, 4, 5, 6]
dice_roll = random.choices(dice, k = 10)
dice_df = pd.DataFrame(dice_roll, columns=['result'])
dice_df

Unnamed: 0,result
0,3
1,2
2,3
3,2
4,1
5,2
6,2
7,2
8,6
9,5


#### 2.- Plot the results sorted by value.

In [4]:
dice_df.value_counts().sort_index(ascending=True)

result
1         1
2         5
3         2
5         1
6         1
Name: count, dtype: int64

#### 3.- Calculate the frequency distribution and plot it. What is the relation between this plot and the plot above? Describe it with words.

In [5]:
# your code here
px.histogram(dice_df)

In [6]:
"""
your comments here
"""
# The plot above lists the results from the dice rolls as a table whereas this plot visualizes the distribution of the dice rolls

'\nyour comments here\n'

## Challenge 2
Now, using the dice results obtained in *challenge 1*, your are going to define some functions that will help you calculate the mean of your data in two different ways, the median and the four quartiles. 

#### 1.- Define a function that computes the mean by summing all the observations and dividing by the total number of observations. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [7]:
# your code here
def mean(list):
    return sum(list) // len(list)

#### 2.- First, calculate the frequency distribution. Then, calculate the mean using the values of the frequency distribution you've just computed. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [44]:
# your code here
sorted_results = dice_df.sort_values(by='result', ascending=True, inplace=True)
sorted_results

In [45]:
sorted_df = dice_df.values.tolist()

In [46]:
sorted_df

[[1], [2], [2], [2], [2], [2], [3], [3], [5], [6]]

#### 3.- Define a function to calculate the median. You are not allowed to use any methods or functions that directly calculate the median value. 
**Hint**: you might need to define two computation cases depending on the number of observations used to calculate the median.

In [47]:
# your code here
def median(list):
    if len(list) % 2 == 0:
        return list[len(list) // 2]
    else:
        return (list[len(list) // 2 + .5] + list[len(list) // 2 - .5]) // 2

median(sorted_df)

[2]

#### 4.- Define a function to calculate the four quartiles. You can use the function you defined above to compute the median but you are not allowed to use any methods or functions that directly calculate the quartiles. 

In [11]:
import math

In [48]:
# your code here
def quartiles(list):
    if len(list) % 2 == 0:
        fifty = list[len(list) // 2]
    else:
        fifty = (list[len(list) // 2 + .5] + list[len(list) // 2 - .5]) // 2
    if len(list) % 4 == 0:
        twentyfive = list[len(list) // 4]
        seventyfive = list[(len(list) // 4) * 3]
    else:
        twentyfive = (list[math.ceil(len(list) // 4)])
        seventyfive = (list[math.ceil((len(list) // 4) * 3)])
    quartiles_results = {'0%': min(list), '25%': twentyfive, '50%': fifty, '75%': seventyfive, '100%': max(list)}
    return quartiles_results

In [49]:
quartiles(sorted_df)

{'0%': [1], '25%': [2], '50%': [2], '75%': [3], '100%': [6]}

## Challenge 3
Read the csv `roll_the_dice_hundred.csv` from the `data` folder.
#### 1.- Sort the values and plot them. What do you see?

In [50]:
# your code here
hundred_dice = pd.read_csv('roll_the_dice_hundred.csv')
hundred_dice

Unnamed: 0.1,Unnamed: 0,roll,value
0,0,0,1
1,1,1,2
2,2,2,6
3,3,3,1
4,4,4,6
...,...,...,...
95,95,95,4
96,96,96,6
97,97,97,1
98,98,98,3


In [15]:
"""
your comments here
"""
# A randomized list of 100 dice rolls

'\nyour comments here\n'

#### 2.- Using the functions you defined in *challenge 2*, calculate the mean value of the hundred dice rolls.

In [51]:
# your code here
mean(hundred_dice['value'])

3

#### 3.- Now, calculate the frequency distribution.


In [52]:
# your code here
hundred_dice['value'].value_counts()

value
6    23
4    22
2    17
3    14
1    12
5    12
Name: count, dtype: int64

#### 4.- Plot the histogram. What do you see (shape, values...) ? How can you connect the mean value to the histogram? 

In [53]:
# your code here
px.histogram(hundred_dice['value'])

In [19]:
"""
your comments here
"""

'\nyour comments here\n'

#### 5.- Read the `roll_the_dice_thousand.csv` from the `data` folder. Plot the frequency distribution as you did before. Has anything changed? Why do you think it changed?

In [54]:
# your code here
thousand_dice = pd.read_csv('roll_the_dice_thousand.csv')
thousand_dice['value'].value_counts()

value
1    175
3    175
4    168
2    167
6    166
5    149
Name: count, dtype: int64

In [55]:
px.histogram(thousand_dice['value'])

In [22]:
"""
your comments here
"""
# The differences between the value counts appear smaller for the thousand as opposed to hundred rolls

'\nyour comments here\n'

## Challenge 4
In the `data` folder of this repository you will find three different files with the prefix `ages_population`. These files contain information about a poll answered by a thousand people regarding their age. Each file corresponds to the poll answers in different neighbourhoods of Barcelona.

#### 1.- Read the file `ages_population.csv`. Calculate the frequency distribution and plot it as we did during the lesson. Try to guess the range in which the mean and the standard deviation will be by looking at the plot. 

In [56]:
# your code here
population_ages = pd.read_csv('ages_population.csv')
px.histogram(population_ages)

# I'm guessing the mean will be 40. It looks relatively evenly skewed with a std of around 10

#### 2.- Calculate the exact mean and standard deviation and compare them with your guesses. Do they fall inside the ranges you guessed?

In [24]:
# your code here
print(mean(population_ages.values))
print(np.std(population_ages.values))

[36.]
12.810089773299795


In [25]:
"""
your comments here
"""
# The mean was a bit lower than I estimated and the std was a bit highger

'\nyour comments here\n'

#### 3.- Now read the file `ages_population2.csv` . Calculate the frequency distribution and plot it.

In [26]:
# your code here
population_ages_2 = pd.read_csv('ages_population2.csv')
px.histogram(population_ages_2)

####  4.- What do you see? Is there any difference with the frequency distribution in step 1?

In [27]:
"""
your comments here
"""
# There is a smaller range of ages here

'\nyour comments here\n'

#### 5.- Calculate the mean and standard deviation. Compare the results with the mean and standard deviation in step 2. What do you think?

In [28]:
# your code here
print(mean(population_ages_2.values))
print(np.std(population_ages_2.values))

[27.]
2.9683286543103677


In [29]:
"""
your comments here
"""
# The mean is lower than the first dataset and the standard deviation is much smaller, likely because the range is smaller

'\nyour comments here\n'

## Challenge 5
Now is the turn of `ages_population3.csv`.

#### 1.- Read the file `ages_population3.csv`. Calculate the frequency distribution and plot it.

In [30]:
# your code here
population_ages_3 = pd.read_csv('ages_population3.csv')
px.histogram(population_ages_3)

#### 2.- Calculate the mean and standard deviation. Compare the results with the plot in step 1. What is happening?

In [31]:
# your code here
print(mean(population_ages_3.values))
print(np.std(population_ages_3.values))

[41.]
16.13663158778808


In [32]:
"""
your comments here
"""

'\nyour comments here\n'

#### 3.- Calculate the four quartiles. Use the results to explain your reasoning for question in step 2. How much of a difference is there between the median and the mean?

In [72]:
mean(population_ages_3.values)

array([41.])

In [68]:
# your code here
quartiles(population_ages_3.values)

{'0%': array([1.]),
 '25%': array([30.]),
 '50%': array([40.]),
 '75%': array([53.]),
 '100%': array([77.])}

In [34]:
"""
your comments here
"""
# There's only a 1 year difference between median and mean, meaning the population is very evenly skewed

'\nyour comments here\n'

#### 4.- Calculate other percentiles that might be useful to give more arguments to your reasoning.

In [35]:
# your code here

In [36]:
"""
your comments here
"""

'\nyour comments here\n'

## Bonus challenge
Compare the information about the three neighbourhoods. Prepare a report about the three of them. Remember to find out which are their similarities and their differences backing your arguments in basic statistics.

In [37]:
# your code here

In [38]:
"""
your comments here
"""

'\nyour comments here\n'