### Lets start off by exploring some data
- Load the file `class1.dataset1.csv` from Canvas
- For this notebook, you should be familiar with a few statistical ideas:
    - [Mean](https://en.wikipedia.org/wiki/Mean)
    - [Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)
    - [Correlation](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html)

You can easily look up how to compute these statistics in Python using a package like numpy.

In [2]:
import pandas as pd 
import numpy as np
df1 = pd.read_csv("class1.dataset1.csv")

In [11]:
df1

Unnamed: 0.1,Unnamed: 0,x,y
0,0,10.0,8.04
1,1,8.0,6.95
2,2,13.0,7.58
3,3,9.0,8.81
4,4,11.0,8.33
5,5,14.0,9.96
6,6,6.0,7.24
7,7,4.0,4.26
8,8,12.0,10.84
9,9,7.0,4.82


### Dataset 1

##### Compute the mean 

Using the software of your choice (numpy?), compute the mean of the the `y` values in the first dataset.

In [6]:
np.mean(df1["y"])

7.500909090909093

- Now compute the mean of the `x` values in the dataset 

In [7]:
np.mean(df1["x"])

9.0

##### Compute the standard deviation 

- Compute the standard deviation of `x` and `y`

In [14]:
np.std(df1["x"])

3.1622776601683795

In [10]:
np.std(df1["y"])

1.937024215108669

##### Compute the correlation

- Compute the correlation of `x` and `y`

In [18]:
np.corrcoef(df1["x"],df1["y"])[0][1]

0.81642051634484

### Generalize your code by writing a method

In [19]:
def summary_stats(file_name):
    '''
    Read in a file called `file_name` and return a dictionary of summary stats
    
    "mean_x": The mean of the x observation
    "mean_y": The mean of the y observation
    "sd_x": The standard deviation of the x observation
    "sd_y": The standard deviation of the y observation
    "corr": The correlation between x and y
    '''
    output = {"mean_x": None,
              "mean_y": None,
              "sd_x": None,
              "sd_y": None,
              "corr": None}
    df = pd.read_csv(file_name)

    # your code here
    output["mean_x"]=np.mean(df["x"])
    output["mean_y"]=np.mean(df["y"])
    output["sd_x"]=np.std(df["x"])
    output["sd_y"]=np.std(df["y"])
    output["corr"]=np.corrcoef(df["x"],df["y"])[0][1]
    
    return output

In [20]:
summary_stats("class1.dataset1.csv")

{'mean_x': 9.0,
 'mean_y': 7.500909090909093,
 'sd_x': 3.1622776601683795,
 'sd_y': 1.937024215108669,
 'corr': 0.81642051634484}

### Use your method to compute summary statistics for the four datasets

In [21]:
for i in range(1,5):
    file_name = "class1.dataset{}.csv".format(i)
    print(summary_stats(file_name))
    # pass

{'mean_x': 9.0, 'mean_y': 7.500909090909093, 'sd_x': 3.1622776601683795, 'sd_y': 1.937024215108669, 'corr': 0.81642051634484}
{'mean_x': 9.0, 'mean_y': 7.500909090909091, 'sd_x': 3.1622776601683795, 'sd_y': 1.93710869148962, 'corr': 0.8162365060002428}
{'mean_x': 9.0, 'mean_y': 7.500000000000001, 'sd_x': 3.1622776601683795, 'sd_y': 1.9359329439927313, 'corr': 0.8162867394895984}
{'mean_x': 9.0, 'mean_y': 7.50090909090909, 'sd_x': 3.1622776601683795, 'sd_y': 1.9360806451340837, 'corr': -0.314046706495578}


### Comment on your observations

Based on your code so far, do you think the datasets are identical? Can you spot any differences? Explain your reasoning

Base on the code above, all the mean_x, mean_y, sd_x, sd_y and correlation are the identical. The reason is all 4 datasets have the same mean and standard division.

### Now let's make some scatter plots!

In [29]:
import altair as alt 

charts = [] 

for i in range(1, 5):

    c = alt.Chart(pd.read_csv("class1.dataset{}.csv".format(i))).mark_circle().encode(
        x='x',
        y='y'
    ).properties(
        height=100,
        width=100
    )
    
    charts.append(c)
    
charts[0] | charts[1] |  charts[2] |  charts[3]

### What do we observe?

Your answer here
Base on the graph above, it shows the correlations between x and y in each datasets. The first and last dataset is discrete and second is polynomial, third is a strigth line(ignore the top point).

You can find an original copy of the data [here](https://www.kaggle.com/carlmcbrideellis/data-anscombes-quartet?select=Anscombe_quartet_data.csv). Don't peek at this until the end.