## Introduction

+ In this lesson, you'll learn how to use some of the key summary statistics methods in Pandas.

## Objectives:

You will be able to:

- Understand and use the `df.describe()` and `df.info()` summary statistics methods
- Use built-in Pandas methods for calculating summary statistics 
- Apply a function to every element in a Series or DataFrame using `s.apply()` and `df.applymap()`


## Getting DataFrame-Level Summary Statistics

When working with a new dataset, the first step is always to begin to understand what makes up that dataset. The Pandas DataFrame class contains two built-in methods that make this very easy for us. 


In [1]:
import pandas as pd

In [None]:
# titanic = pd.read_csv("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv",
#                            sep='\t')   
# titanic.head()

In [5]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head(3)

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### Using `df.info()`

The `df.info()` method provides us with summary **_metadata_** about our DataFrame -- that is, it gives us data about our dataset, such as how many rows and columns it contains, and what data types they are stored as.  Let's demonstrate this by reading in the Titanic dataset and calling the `.info()` method on the DataFrame. 

In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   891 non-null    int64  
 1   PassengerId  891 non-null    int64  
 2   Survived     891 non-null    int64  
 3   Pclass       891 non-null    object 
 4   Name         891 non-null    object 
 5   Sex          891 non-null    object 
 6   Age          714 non-null    float64
 7   SibSp        891 non-null    int64  
 8   Parch        891 non-null    int64  
 9   Ticket       891 non-null    object 
 10  Fare         891 non-null    float64
 11  Cabin        204 non-null    object 
 12  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(6)
memory usage: 90.6+ KB


As we can see from the output above, the `.info()` method provides us with great information about the characteristics of the DataFrame, without telling us anything about the data it actually contains. 

Examine the output above, and take note of the important things it tells us about the DataFrame, such as:

* The number of columns and rows in the DataFrame
* The data type of the data each column contains
* How many values each column contains (NaNs are not counted)
* The memory footprint of the DataFrame

This sort of information about a dataset is called **_metadata_**, since it's data about our data. 


### Using `.describe()` 

The next step in Exploratory Data Analysis (EDA) is usually to dig into the summary statistics of the dataset, and get a feel for the data each column contains.  Rather than force us to deal with the tedium of doing this individually for every column, Pandas DataFrames provide the handy `df.describe()` method which calculates the basic summary statistics for each column for us automatically. 

See the example in the cell below.

In [8]:
titanic.describe()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,445.0,446.0,0.383838,29.699118,0.523008,0.381594,32.204208
std,257.353842,257.353842,0.486592,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,222.5,223.5,0.0,20.125,0.0,0.0,7.9104
50%,445.0,446.0,0.0,28.0,0.0,0.0,14.4542
75%,667.5,668.5,1.0,38.0,1.0,0.0,31.0
max,890.0,891.0,1.0,80.0,8.0,6.0,512.3292


# Note :
+ As we can see, the output of the `.describe()` method is very handy, and gives us relevant information such as:

* a `count` of the number of values in each column, making it identify columns with missing values
* The mean and standard deviation of each column
* The minimum and maximum values found in each column
* The median (50%) and quartile values (25% & 75%) for each column

Use the `.describe()` method to quickly help you get a feel for your datasets when you start the Exploratory Data Analysis process. 


## Calculating Individual Column Statistics


If we need to calculate individual statistics about a column, we can also do this easily.  Pandas DataFrames and Series objects come with a plethora of built-in methods to instantly calculate summary statistics for us. 

See the code block below for examples:

In [9]:
# we can see the mean value for all the numerical columns
titanic.mean()

Unnamed: 0     445.000000
PassengerId    446.000000
Survived         0.383838
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

# Find the mean for the fare column

In [13]:
titanic.Fare.mean().round(3) # rounding it to 3 decimal places

32.204

# Find the value for 90% quantile for a specific column

In [14]:
titanic.Age.quantile(.9)

50.0

# Find the median value for the Age column

In [15]:
titanic.Age.median()

28.0

### There are many different statistical methods built into Pandas DataFrames -- these are just a few! We will not list all of them, but here are some common ones you'll probably make use of early and often:

* `.mode()` -- the mode of the column
* `.count()` -- the count of the total number of entries in a column
* `.std()` -- the standard deviation for the column.It is a measure of spread of data. It is also the square root of the variance.
* `.var()` -- the variance for the column  
* `.sum()` -- the sum of all values in the column
* `.cumsum()` -- the cumulative sum, where each cell index contains the sum of all indices lower than, and including, itself.


### Summary Statistics for Categorical Columns

Obviously, we cannot calculate most summary statistics on columns that contain non-numeric data -- there's no way for us to find the mean of the letters in the `Embarked` column, for instance.  However, there are some summary statistics we can use to help us better understand our categorical columns. 

See the examples in the cell below:

In [16]:
titanic["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [17]:
titanic["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

# These methods are extremely useful when dealing with categorical data! 

`.unique()` shows us all the unique values contained in the column. 

`.value_counts()` shows us a count for how many times each unique value is present in a dataset, giving us a feel for the distribution of values in the column. 

# Remember Lambda Functions - well she is back

![lambda](https://media.giphy.com/media/3o7TKJ9WK7JRk9QVos/giphy.gif)

### Calculating on the Fly with `.apply()` and `.applymap()`

Sometimes, we'll need to make changes to our dataset, or to compute functions on our data that aren't built-in to Pandas.  We can do this by passing lambda values into the `apply()` method when working with Pandas series, and the `.applymap()` method when working with Pandas DataFrames. 

Note that both of these do not mutate the original dataset -- instead, they return a copy of the Series or DataFrame containing the result. 

See the example in the cell below:

In [18]:
# Quick function to convert every value in the DataFrame to a string. This is not advisable though
string_df = titanic.applymap(lambda x: str(x))
string_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   891 non-null    object
 1   PassengerId  891 non-null    object
 2   Survived     891 non-null    object
 3   Pclass       891 non-null    object
 4   Name         891 non-null    object
 5   Sex          891 non-null    object
 6   Age          891 non-null    object
 7   SibSp        891 non-null    object
 8   Parch        891 non-null    object
 9   Ticket       891 non-null    object
 10  Fare         891 non-null    object
 11  Cabin        891 non-null    object
 12  Embarked     891 non-null    object
dtypes: object(13)
memory usage: 90.6+ KB


# Let's quickly square every value in the Age column

In [23]:
print ("\nThe lambda function applies to only this part\n",'-'*60, sep='')

display(titanic['Age'].apply(lambda x: x**2).head())


print("\nNote that the original data in the age column has not changed\n","-"*60, sep="") 
titanic['Age'].head()


The lambda function applies to only this part
------------------------------------------------------------


0     484.0
1    1444.0
2     676.0
3    1225.0
4    1225.0
Name: Age, dtype: float64


Note that the original data in the age column has not changed
------------------------------------------------------------


0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [None]:
# We can also perform various types of investigations from this data

In [26]:
import numpy as np

In [24]:
titanic.dtypes == "object"
non_numeric = titanic.dtypes[titanic.dtypes == "object"].index

titanic[non_numeric].describe()

Unnamed: 0,Pclass,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,891,204,889
unique,4,891,2,681,147,3
top,3,"Heikkinen, Miss. Laina",male,CA. 2343,G6,S
freq,469,1,577,7,4,644


# Let's take a look at only the non-numerical columns


In [28]:
titanic.dtypes[titanic.dtypes == "object"].index

Index(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

# Let's take a look at the Name, Sex and Age columns only

In [29]:
titanic[["Name","Sex","Age"]].head()

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


# Check the first few sorted names

In [30]:
sorted(titanic["Name"])[5:10:2]

['Adahl, Mr. Mauritz Nils Martin',
 'Ahlin, Mrs. Johan (Johanna Persdotter Larsson)',
 'Albimona, Mr. Nassef Cassem']

# Let's take a look at the descriptive statistics for the names columns

In [31]:
titanic["Name"].describe()

count                        891
unique                       891
top       Heikkinen, Miss. Laina
freq                           1
Name: Name, dtype: object

 # Check unique cabins

In [35]:
titanic["Cabin"].unique()  

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

In [41]:
# Convert data to str
char_cabin = titanic["Cabin"].astype(str)

new_Cabin = [cabin[0] for cabin in char_cabin] # Take first letter

new_Cabin = pd.Categorical(new_Cabin)

#new_Cabin
new_Cabin

[n, C, n, C, n, ..., n, B, n, C, n]
Length: 891
Categories (9, object): [A, B, C, D, ..., F, G, T, n]

In [42]:
# creating a new column for the cabin
titanic["Cabin"] = new_Cabin

In [43]:
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [46]:
# Exploring the Fare column
np.where(titanic["Fare"]==max(titanic["Fare"]))

(array([258, 679, 737]),)

# Let's find those who paid the maximum fares on the ship

In [47]:
titanic.iloc[np.where(titanic["Fare"]==max(titanic["Fare"]))]

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,n,C
679,679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B,C
737,737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B,C


In [49]:
# Let's Create a new columns and try finding the families on the ship
titanic["Family"] = titanic["SibSp"] + titanic["Parch"]
titanic["Family"]

0      1
1      1
2      0
3      1
4      0
      ..
886    0
887    0
888    3
889    0
890    0
Name: Family, Length: 891, dtype: int64

In [50]:
# we can see a new column has been created for family
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S,1
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C,1
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S,0
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S,1
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S,0


# Find Max families on the ship

In [51]:
most_family = np.where(titanic["Family"] == max(titanic["Family"]))
most_family

(array([159, 180, 201, 324, 792, 846, 863]),)

In [53]:
titanic["Family"] = titanic["SibSp"] + titanic["Parch"]

most_family = np.where(titanic["Family"] == max(titanic["Family"]))
most_family
titanic.iloc[most_family]

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
159,159,160,0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,n,S,10
180,180,181,0,?,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,n,S,10
201,201,202,0,3,"Sage, Mr. Frederick",male,,8,2,CA. 2343,69.55,n,S,10
324,324,325,0,3,"Sage, Mr. George John Jr",male,,8,2,CA. 2343,69.55,n,S,10
792,792,793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,n,S,10
846,846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,n,S,10
863,863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.55,n,S,10


## Summary

In this lesson, you learned how to:

* Understand and use the `df.describe()` and `df.info()` summary statistics methods 
* Use built-in Pandas methods for calculating summary statistics 
* Apply a function to every element in a Series or DataFrame using `s.apply()` and `df.applymap()` 

![high5](https://media.giphy.com/media/W0Vz7zDK7vgMBc2fpw/giphy.gif)