---
# Computing Aggregate Statistics

In [None]:
import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]

GradeList = list(zip(names,grades))
df = pd.DataFrame(data = GradeList, columns=['Names', 'Grades'])
df

In [None]:
df['Grades'].count() # computes the number of values

df['Grades'].mean() # computes the arithmetic average of the values

df['Grades'].std() # computes the standard deviation of the values

df['Grades'].min() # computes the minimum of the values
df['Grades'].max() # computes the maximum of the values

df['Grades'].quantile(.25) # computes the first quartile of the values
df['Grades'].quantile(.5) # computes the second quartile of the values
df['Grades'].quantile(.75) # computes the third quartile of the values

### Note
If you tried to execute the previous code in one cell all at the same time, the only thing you will see is the output of the .quantile() function. You have to try them one by one. I just grouped them all together for reference purposes. OK?

### Other Measures of Central Tendency

In [None]:
# computes the arithmetic average of the values in a column
# mean = dividing the sum of all values by the number of values
df['Grades'].mean()

# finds the median of the values in a column
# median = the middle value if they are sorted in order
df['Grades'].median()

# finds the mode of the values in a column
# mode = the most common single value
df['Grades'].mode()

In [None]:
df['Grades'].var()

In [None]:
df.var()

### Your Turn
Of course, in our dataset we only have one column. Try creating a dataframe and computing summary statistics using the following dataset.

In [None]:
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]

---
# Computing Aggregate Statistics on Matching Rows

In [None]:
import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]

GradeList = list(zip(names,grades,bsdegrees,msdegrees,phddegrees))

df = pd.DataFrame(data=GradeList, columns=['Name','Grade','BS','MS','PhD'])
df

In [None]:
df.loc[df['PhD']==0].count()

In [None]:
df.loc[df['PhD']==0]['Grade'].mean()

### Your Turn
Using the following data, what is the average grade for people with MS degrees?

In [None]:
import pandas as pd
  
names = ['Bob','Jessica','Mary','John','Mel','Sam','Cathy','Henry','Lloyd']
grades = [76,95,77,78,99,84,79,100,73]
bsdegrees = [1,1,0,0,1,1,1,0,1]
msdegrees = [2,1,0,0,0,1,1,0,0]
phddegrees = [0,1,0,0,0,2,1,0,0]

---
# Sorting Data

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

In [None]:
df = df.sort_values(by='age', ascending=0)
df.head()

In [None]:
df = df.sort_values(by=['grade', 'age'], ascending=[True, True])
df.head()

### Your Turn
Can you sort the dataframe to order it by name, age and then grade?

---
# Correlation

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

In [None]:
df.corr()

### Your Turn
Load the data in the following code and find the correlations:

In [None]:
import pandas as pd
 
Location = "datasets/tamiami.csv"

---
# Regression

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

In [None]:
import statsmodels.formula.api as sm
result = sm.ols(formula='grade ~ age + exercise + hours', data=df).fit()
result.summary()

In [None]:
import statsmodels.formula.api as sm
result = sm.ols(formula='grade ~ exercise + hours', data=df).fit()
result.summary()

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

result = sm.ols(formula='grade ~ age + exercise + hours - 1', data=df).fit()
result.summary()

### Your Turn
Create a new column where you convert gender to numeric values like 1 for female and 0 for male. Can you now add gender to your regression? Does this improve your R-squared?