# Lab 2 : Summary Statistics and Probabilities

## Reminders on using Google CoLab:

* Every time you reopen the lab, you need to upload the data file again
* Every time you reopen the lab, you need to run all cells again from the beginning
* If you need to add cells, hover your mouse between two cells and 
    * select `+ Code` to create a cell for your code
    * select `+ Text` to create a cell for discussion on the results, including your answers to the lab's questions 

You are welcome to save a copy of this project into your own Google folder.
* In the `File` menu, select `Save a copy in Drive`
* You can now edit the project and save your work

## Instructions

In Lab 2, you are going to do an analysis of a dataset. Below, I have included code to help you find,
* Summary Statistics (Mean, Standard Deviation, and the 5-number Summary)
* Probability Distributions (counting each category then dividing by the total)

For the example code, I am using the [Duke Forest Housing Dataset](https://www.openintro.org/data/csv/duke_forest.csv) from our [textbook dataset website](https://www.openintro.org/data/). You will want to download the dataset and then load into Google CoLab. 

After doing completing the example code, your job will be to select a dataset from the [textbook dataset website](https://www.openintro.org/data/) and perform a similar analysis. You are welcome to use the same dataset that you used for Lab 1. Remember that *an analysis is not complete without appropriate graphs*. Go back to Lab 1 to remember how to make the different types of graphs.

## Load the Data

In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
duke = pd.read_csv("../Datasets/duke_forest.csv")
duke.head()

Unnamed: 0,address,price,bed,bath,area,type,year_built,heating,cooling,parking,lot,hoa,url
0,"1 Learned Pl, Durham, NC 27705",1520000,3,4.0,6040,Single Family,1972,"Other, Gas",central,0 spaces,0.97,,https://www.zillow.com/homedetails/1-Learned-P...
1,"1616 Pinecrest Rd, Durham, NC 27705",1030000,5,4.0,4475,Single Family,1969,"Forced air, Gas",central,"Carport, Covered",1.38,,https://www.zillow.com/homedetails/1616-Pinecr...
2,"2418 Wrightwood Ave, Durham, NC 27705",420000,2,3.0,1745,Single Family,1959,"Forced air, Gas",central,"Garage - Attached, Covered",0.51,,https://www.zillow.com/homedetails/2418-Wright...
3,"2527 Sevier St, Durham, NC 27705",680000,4,3.0,2091,Single Family,1961,"Heat pump, Other, Electric, Gas",central,"Carport, Covered",0.84,,https://www.zillow.com/homedetails/2527-Sevier...
4,"2218 Myers St, Durham, NC 27707",428500,4,3.0,1772,Single Family,2020,"Forced air, Gas",central,0 spaces,0.16,,https://www.zillow.com/homedetails/2218-Myers-...


## Summary Statistics for Quantitative Variables

In [None]:
print("             Mean = ", duke['bath'].mean())
print("    Sample St Dev = ", duke['bath'].std())
print("Population St Dev = ", duke['bath'].std(ddof=0))
print("Min    = ", duke['bath'].min())
print("Q1     = ", duke['bath'].quantile(0.25))
print("Median = ", duke['bath'].median())
print("Q3     = ", duke['bath'].quantile(0.75))
print("Max    = ", duke['bath'].max())

             Mean =  3.107142857142857
    Sample St Dev =  0.9340356954701309
Population St Dev =  0.9292579879469777
Min    =  1.0
Q1     =  2.5
Median =  3.0
Q3     =  4.0
Max    =  6.0


In [58]:
print(duke.describe())

              price        bed       bath         area   year_built        lot
count  9.800000e+01  98.000000  98.000000    98.000000    98.000000  97.000000
mean   5.598987e+05   3.744898   3.107143  2779.265306  1967.051020   0.571134
std    2.254481e+05   0.750412   0.934036   943.205300    18.393955   0.219155
min    9.500000e+04   2.000000   1.000000  1094.000000  1923.000000   0.150000
25%    4.506250e+05   3.000000   2.500000  2132.750000  1956.250000   0.450000
50%    5.400000e+05   4.000000   3.000000  2623.000000  1962.000000   0.550000
75%    6.437500e+05   4.000000   4.000000  3253.750000  1972.000000   0.650000
max    1.520000e+06   6.000000   6.000000  6178.000000  2020.000000   1.470000


## Proportions

### 1-Variable Proportions

In [None]:
# Let's get a count of each category in the cooling variable (a categorical variable)
cooling = duke['cooling'].value_counts()
cooling

cooling
other      53
central    45
Name: count, dtype: int64

In [None]:
# Turn counts into proportions by dividing by the total
cooling_proportions = cooling / cooling.sum()
cooling_proportions

cooling
other      0.540816
central    0.459184
Name: count, dtype: float64

### 2-Variable Proportions

In [None]:
# Find the counts for each combination of bed and bath (2 categorical variables)
    # Note that 'bed' and 'bath' are quantitative, but we can also treat them as categorical variables
bed_bath = pd.crosstab(duke['bed'], duke['bath'])

# Let's see the table
print(bed_bath)

bath  1.0  2.0  2.5  3.0  4.0  4.5  5.0  6.0
bed                                         
2       1    2    0    1    0    0    0    0
3       2   12    3   11    2    0    0    0
4       0    4    3   26   16    1    2    0
5       0    0    0    3    5    0    2    1
6       0    0    0    0    0    0    1    0


In [None]:
# Use display() to format the output nicely
display(bed_bath)

bath,1.0,2.0,2.5,3.0,4.0,4.5,5.0,6.0
bed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,1,2,0,1,0,0,0,0
3,2,12,3,11,2,0,0,0
4,0,4,3,26,16,1,2,0
5,0,0,0,3,5,0,2,1
6,0,0,0,0,0,0,1,0


In [None]:
# We need to find the total number of houses in all the categories
bed_bath.sum().sum()

np.int64(98)

In [None]:
# Now we turn our counts into proportions by dividing by the total
display(bed_bath / bed_bath.sum().sum())

bath,1.0,2.0,2.5,3.0,4.0,4.5,5.0,6.0
bed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,0.010204,0.020408,0.0,0.010204,0.0,0.0,0.0,0.0
3,0.020408,0.122449,0.030612,0.112245,0.020408,0.0,0.0,0.0
4,0.0,0.040816,0.030612,0.265306,0.163265,0.010204,0.020408,0.0
5,0.0,0.0,0.0,0.030612,0.05102,0.0,0.020408,0.010204
6,0.0,0.0,0.0,0.0,0.0,0.0,0.010204,0.0


-----
## Your turn to do the analysis

Your job now is to select a dataset from the [textbook dataset website](https://www.openintro.org/data/) and perform a similar analysis. You are welcome to use the same dataset that you used for Lab 1. Remember that *an analysis is not complete without appropriate graphs*. Go back to Lab 1 to remember how to make the different types of graphs.

Once your analysis is complete, answer the following questions:
* Questions about each of your quantitative variables:
    1. For each quantitative variable, determine whether the data is normal, bimodal, or uniform
    2. How spread out is the data? Is the data tightly clustered around the mean or widely dispersed? Discuss why you chose your answer.
    3. For each quantitative variable, use a boxplot and the 5-number summary to determine if the data is symmetrical or skewed. Identify the type of skew if applicable.
    4. What is the range of the middle 50% of the data? (That is, what is the interquartile range, or IQR)
    5. Are there any unusually high or low values (outliers) in the dataset?
* Questions about each of your categorical variables:
    1. What's the most common category or outcome? What percentage of the data falls into this category?
    2. Are the categories evenly distributed? If not, which category is the least common?
    3. Do the proportions of the different categories suggest any interesting patterns or imbalances?

-----
To submit your assignment,
* Click on the `Share` button above
* Under General Access, select `Anyone with the link`
* Copy the link
* Paste the link into Canvas
* Click `Submit`