# An introduction to descriptive statistics

## Table of contents:
 - [1 Load the dataset](#1-Load-the-dataset)
 - [2 Get the dataset](#2-Get-the-dataset)
 - [3 Find the value counts](#3-Find-the-value-counts)
 - [4 Find the mean of a group](#4-Find-the-mean-of-a-group)
 - [5 Find the maximal value](#5-Find-the-maximal-value)
 - [6 Find the variance](#6-Find-the-variance)

# 1 Load the dataset
<a name="1-Load-the-dataset"></a>

To begin load the abalone dataset to a pandas DataFrame using the file path ./abalone.csv. Fill the code below into the empty cell (either type it yourself, or use copy and paste) and then run your code. Jupyter will display the output of last expression by default, so to see the loaded DataFrame just add the name of the DataFrame as the last expression in the cell.

`import pandas as pd`

`abalone_df = pd.read_csv('./abalone.csv')`

In [1]:
import pandas as pd
abalone_df = pd.read_csv('./abalone.csv')

# 2 Get the dataset
<a name="2-Get-the-dataset"></a>

You can use pandas methods to generate a helpful range of summary statistics. Using `pd.DataFrame.describe()` will generate a useful set of descriptive statistics. Fill the code below into the empty cell and then run your code. 

`abalone_df.describe()` 

In [2]:
abalone_df.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


Calling a method without passing arguments will result with the default values of the parameters. Try adding `percentiles=[0.2,0.4,0.6,0.8]` to the method call to see how it changes the output. 

`abalone_df.describe(percentiles=[0.2,0.4,0.6,0.8])` 

In [3]:
abalone_df.describe(percentiles=[0.2, 0.4, 0.6, 0.8])

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
20%,0.425,0.325,0.105,0.366,0.157,0.0765,0.1091,7.0
40%,0.51,0.395,0.13,0.6449,0.2745,0.1405,0.1895,9.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
60%,0.575,0.45,0.15,0.93,0.4003,0.201,0.27,10.0
80%,0.625,0.495,0.175,1.2393,0.5424,0.273,0.3518,12.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


Another approach is to use `pd.DataFrame.info()`, which prints information about the DataFrame such as the names, counts of non-empty values and the datatype of every column. Fill the model code below into the empty cell and then run your code. 

`abalone_df.info()` 

In [4]:
abalone_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


You can also use the `pd.DataFrame.shape` attribute to print the shape of the DataFrame. The result is a tuple with the number of rows and columns (rows, columns). A tuple is a kind of Python variable that allows you to store more than one item. In this case, we have 4177 rows and 9 columns. Fill the model code below into the empty cell and then run your code. 

`abalone_df.shape` 

In [6]:
abalone_df.shape

(4177, 9)

One additional useful method to explore the data is the [`pd.DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html). This returns the first n rows of the DataFrame, so you can check if your columns have the data you expect. Fill the code below into the empty cell and then run your code. What is the default value for n? 

`abalone_df.head()` 

In [7]:
abalone_df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


# 3 Find value counts
<a name="3-Find-value-counts"></a>

In the analysis of the abalone dataset, you may want to find the value counts for a column to help you understand what your data contains. To find the value count for the sex column in the dataset, fill the model code below into the empty cell and then run your code. 

`abalone_df['Sex'].value_counts()` 

In [8]:
abalone_df['Sex'].value_counts()

M    1528
I    1342
F    1307
Name: Sex, dtype: int64

From the output we see that the mean height of male abalone is 0.151381.  

We may also want to calculate the mean for all the groups in the Sex column. We can use the [`pd.DataFrame.groupby(`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method for that, allowing us to find out for example the height differences between the different sexes in the abalone data. Copy the code to the cell below and run to see the result.  

Often you may want to check the ratios you have in your data. To do so simply adjust the normalize parameter to see the ratios between the genders in the abalone dataset. 

`abalone_df.value_counts(normalize=True)` 

In [9]:
abalone_df.value_counts(normalize=True)

Sex  Length  Diameter  Height  Whole weight  Shucked weight  Viscera weight  Shell weight  Rings
F    0.275   0.195     0.070   0.0800        0.0310          0.0215          0.0250        5        0.000239
M    0.400   0.320     0.095   0.3030        0.1335          0.0600          0.1000        7        0.000239
     0.405   0.305     0.120   0.3185        0.1235          0.0905          0.0950        7        0.000239
             0.310     0.100   0.3850        0.1730          0.0915          0.1100        7        0.000239
     0.410   0.300     0.100   0.3010        0.1240          0.0690          0.0900        9        0.000239
                                                                                                      ...   
I    0.250   0.185     0.065   0.0710        0.0270          0.0185          0.0225        5        0.000239
             0.190     0.060   0.0765        0.0360          0.0115          0.0245        6        0.000239
                       0.065   

# 4 Find the mean of a group
<a name="4-Find-the-mean-of-a-group"></a>

We already know the mean for the columns in the abalone dataset from running df.describe(). In your analysis you might want to calculate the mean of a column for a specific group, e.g. the mean height of male abalones. So, to find out the mean of males in the abalone data, we could use this code. Copy the code to the cell below and run to see the result.  

`abalone_df.loc[abalone_df['Sex'] == 'M','Height'].mean()` 

In [11]:
abalone_df.loc[abalone_df['Sex'] == 'M', 'Height'].mean()

0.15138089005235603

From the output, you can see that the mean height of infant abalone is smaller than the male and female heights.  

# 5 Find the maximal value
<a name="5-Find-the-maximal-value"></a>

Another approach you can try during exploration is to find the maximum value in a given column and print the row that it belongs to. For example, you might like to know the heaviest abalone in the dataset. To do so, run the code in the cells below. 

In [12]:
# find the index of the maximum weight 
row_index = abalone_df['Whole weight'].idxmax()
row_index

891

In [13]:
# access the row with the heaviest abalone 
abalone_df.loc[row_index, :]

Sex                    M
Length              0.73
Diameter           0.595
Height              0.23
Whole weight      2.8255
Shucked weight    1.1465
Viscera weight     0.419
Shell weight       0.897
Rings                 17
Name: 891, dtype: object

Note that we have done this in two parts, first finding the index of the maximum weight and then printing the row that it belongs to. 

# 6 Find the variance
<a name="6-Find-the-variance"></a>

Finally, variance is useful for understanding how far the values are spread from their mean value. For the abalone dataset, we could use finding the variance of shell weight as an example. Fill the code below into the empty cell and then run your code. 

`abalone_df['Shell weight'].var()` 

In [14]:
abalone_df['Shell weight'].var()

0.019377383202158645

<div class="warning" style='padding:0.1em; background-color:#e6ffff'>
<span>
<p style='margin:1em;'>
<b>Congratulations!</b></p>
<p style='margin:1em;'>
You’ve now completed this activity, which will be useful to the data exploration and analysis you will compete in your assignment work. To revise this activity, return to the Canvas page and read over the content under “What you’ll learn”.
</p>
<p style='margin-bottom:1em; margin-right:1em; text-align:right; font-family:Georgia'>
</p></span>
</div>