In order to successfully complete this assignment you must do the required reading, watch the provided videos and complete all instructions.  The embedded survey form must be entirely filled out and submitted on or before **11:59pm on Monday September 21**.  Students must come to class the next day prepared to discuss the material covered in this assignment. answer

# Pre-Class Assignment: Statistical Models

### Goals for today's pre-class assignment 

1. [Where to start?](#Where_to_start)
1. [Statistics in Numpy ](#Statistics_in_Numpy)
2. [Statistics in Pandas ](#Statistics_in_Pandas)
3. [Statistics in Scipy ](#Statistics_in_SciPy)
5. [Statsmodels](#statsmodels)
6. [ANOVA](#ANOVA)
4. [Assignment wrap-up](#Assignment_wrap-up)

----
<a name="Where_to_start"></a>

# 1. Where to start?

Lets pretend we have been asked to do some statistical analysis of some data by our advisor.  The question is where do we start?  First, lets grab some data we can use for testing. Here is some data I found using a diet dataset:

https://bioinformatics-core-shared-training.github.io/linear-models-r/anova.html



In [None]:
from urllib.request import urlretrieve

url = "https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv"
file = "stcp-Rdataset-Diet.csv"

urlretrieve(url, file);

In [None]:
#Direct but messy way to read a CSV file
import csv
data = []
header = True
with open(file) as csvDataFile:
    csvReader = csv.reader(csvDataFile)

    for row in csvReader:
        if header:
            names = row
            header = False
        else:          
            nums = []
            for s in row:
                try:
                    nums.append(float(s))
                except ValueError:
                    print(f"Can't append {s}")
                    nums.append(0.0)
            data.append(nums)
print(names)
data

----
<a name="Statistics_in_Numpy"></a>
# 2. Statistics in Numpy 

NumPy has quite a few useful statistical functions for finding minimum, maximum, percentile standard deviation and variance, etc. from the given elements in the array.

&#9989; **<font color=red>DO THIS:</font>** Using numpy find the max, min, mean and average for the each column in the data.

In [None]:
import numpy as np
data = np.genfromtxt(file, delimiter=',')

#clean the data
data = data[1:]
data[np.isnan(data)] = 0
data

In [None]:
print("max",np.max(data,axis=0))
print("min",np.min(data,axis=0))
print("mean",np.mean(data,axis=0))
print("average",np.average(data,axis=0))

&#9989; **<font color=red>QUESTION:</font>** What other statistical function does numpy provide? 

Here is a link to a website with all the numpy statistical functions:  
https://numpy.org/doc/stable/reference/routines.statistics.html

----
<a name="Statistics_in_Pandas"></a>
# 3. Statistics in Pandas
The pandas library does a good job with statistics as well. Note we will read the data in again.  In pandas, everything is organized into a DataFrame object. One of my favorit pandas function is ```describe```.  

In [None]:
%matplotlib inline
import pandas as pd

df = pd.read_csv(file)
df.plot()

In [None]:
df.describe()

&#9989; **<font color=red>QUESTION:</font>** What other statistical function does pandas provide? 

Here is an article with a lot of pandas statistical functions:  
https://medium.com/swlh/statistical-functions-of-pandas-2862c290053a

----
<a name="Statistics_in_SciPy"></a>

# 4. Statistics in SciPy

Finnally Scipy has a wonderful range of statistics that many people can use.  Most of the are located in ```scipy.stats```:

[Link to descriptions of scipy statistical function](https://docs.scipy.org/doc/scipy/reference/stats.html)

&#9989; **<font color=red>QUESTION:</font>** Find a scipy function or group of functions that look interesting. See if you can figure out how they wrok.  Come to class prepared to talk about what you learned.

I focused on the frequency statistics section and I thought it was cool you could use multidimensional data. "multidimensional binned statistic"

----
<a name="statsmodels"></a>

# 5. Statsmodels

&#9989; **<font color=red>QUESTION:</font>** Review the Statsmoels documentation and see if you can find anything interesting. Write down notes of what you found so you can share them with the class. 

Really basic looking compared to other stats packages. My comp chem boss would have killed me over something so ugly.

In [None]:
import statsmodels.api as sm

---
<a name="ANOVA"></a>
# 6. ANOVA

Now let us assume that your advisor wants you to run an ANOVA analsyis.  

&#9989; **<font color=red>QUESTION:</font>** What is an ANOVA analysis? (either describe in your own words or provide a reference with a description you find and like)

https://www.youtube.com/watch?v=oOuu8IBd-yo  
Tests the difference between two or more groups.  
Analysis of Variance  


&#9989; **<font color=red>QUESTION:</font>** There are typically three assumptions that need to be made in order to use an ANOVA anlysis to compare populations. What are those three assumptions?

1. The responses for each factor level are normally distributed.  
2. The distributions have the same variance.  
3. The data are independent.

&#9989; **<font color=red>DO THIS:</font>** Find and test some example code that will run an ANOVA experiment.  Feel free to use any of the above libraries (Or find your own).

I found a site with multiple ways and copied one into this notebook.  
https://www.pythonfordatascience.org/anova-python/

In [None]:
import pandas as pd
import researchpy as rp

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)

# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)

df.info()

In [None]:
rp.summary_cont(df['libido'])

In [None]:
rp.summary_cont(df['libido'].groupby(df['dose']))

In [None]:
import scipy.stats as stats

stats.f_oneway(df['libido'][df['dose'] == 'high'],
               df['libido'][df['dose'] == 'low'],
               df['libido'][df['dose'] == 'placebo'])

----
<a name="Assignment_wrap-up"></a>
# 7. Assignment wrap-up

Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**

[Direct Link to Google Form](https://cmse.msu.edu/cmse802-pc-survey)


If you have trouble with the embedded form, please make sure you log on with your MSU google account at [googleapps.msu.edu](https://googleapps.msu.edu) and then click on the direct link above.

&#9989; **<font color=red>Assignment-Specific QUESTION:</font>** Where you able to find and get an ANOVA example working? If so, what library did you use? If not, where did you get stuck?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  Summarize what you did in this assignment.

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  What questions do you have, if any, about any of the topics discussed in this assignment after working through the jupyter notebook?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  How well do you feel this assignment helped you to achieve a better understanding of the above mentioned topic(s)?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** What was the **most** challenging part of this assignment for you? 

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** What was the **least** challenging part of this assignment for you? 

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  What kind of additional questions or support, if any, do you feel you need to have a better understanding of the content in this assignment?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  Do you have any further questions or comments about this material, or anything else that's going on in class?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** Approximately how long did this pre-class assignment take?

Put your answer to the above question here

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://cmse.msu.edu/cmse802-pc-survey?embedded=true" 
	width="100%" 
	height="1200px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

---------
### Congratulations, we're done!

To get credit for this assignment you must fill out and submit the above survey from on or before the assignment due date.

### Course Resources:


- [Website](https://msu-cmse-courses.github.io/cmse802-f20-student/)
- [ZOOM](https://msu.zoom.us/j/97272546850)
- [Syllabus](https://docs.google.com/document/d/e/2PACX-1vT9Wn11y0ECI_NAUl_2NA8V5jcD8dXKJkqUSWXjlawgqr2gU5hII3IsE0S8-CPd3W4xsWIlPAg2YW7D/pub)
- [Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vQRAm1mqJPQs1YSLPT9_41ABtywSV2f3EWPon9szguL6wvWqWsqaIzqkuHkSk7sea8ZIcIgZmkKJvwu/pubhtml?gid=2142090757&single=true)



Written by Dirk Colbry, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.