# Viewing and Understanding DataFrames using Pandas

After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either view a small sample of the dataset or a summary of the data in the form of summary statistics.

## Viewing Data using `.head()` and `.tail()`

You can view the first few or last few rows of a DataFrame using the `.head()` or `.tail()` methods, respectively. You can specify the number of rows through the `n` argument (the default value is 5).

```python
df.head()
```
This will show the first five rows of the DataFrame.

```python
df.tail(n=10)
```
This will show the last ten rows of the DataFrame.

In [22]:
import pandas as pd
df = pd.read_csv("diabetes.csv")
df.head(n=10)
# Show the first five rows of the DataFrame

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,,35,,33.6,0.627,50,1
1,1,85,,29,0.0,26.6,0.351,31,0
2,8,183,64.0,0,0.0,23.3,0.672,32,1
3,1,89,66.0,23,94.0,28.1,0.167,21,0
4,0,137,40.0,35,168.0,43.1,2.288,33,1
5,5,116,74.0,0,0.0,25.6,0.201,30,0
6,3,78,50.0,32,,31.0,0.248,26,1
7,10,115,0.0,0,0.0,35.3,0.134,29,0
8,2,197,70.0,45,543.0,30.5,0.158,53,1
9,8,125,96.0,0,0.0,0.0,0.232,54,1


In [2]:
df.tail(n=10)
# Show the last ten rows of the DataFrame

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
758,1,106,76,0,0,37.5,0.197,26,0
759,6,190,92,0,0,35.5,0.278,66,1
760,2,88,58,26,16,28.4,0.766,22,0
761,9,170,74,31,0,44.0,0.403,43,1
762,9,89,62,0,0,22.5,0.142,33,0
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


## Understanding Data using `.describe()`

The `.describe()` method prints the summary statistics of all numeric columns such as count, mean, standard deviation, range, and quartiles of numeric columns.

```python
df.describe()
```
This gives a quick look at the scale, skew, and range of numeric data.

You can also modify the quartiles using the `percentiles` argument.
```python
df.describe(percentiles=[0.3, 0.5, 0.7])
```
This example looks at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame `df`.

To summarize only integer columns, use the `include` argument.
```python
df.describe(include=[int])
```
Similarly, you might want to exclude certain data types using the `exclude` argument.
```python
df.describe(exclude=[int])
```
Often, practitioners find it easy to view such statistics by transposing them with the `.T` attribute.
```python
df.describe().T
```
This will transpose the summary statistics.

In [4]:
df.describe()
# Get summary statistics with .describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
df.describe(percentiles=[0.3, 0.5, 0.7])
# Get summary statistics with specific percentiles

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
30%,1.0,102.0,64.0,8.2,0.0,28.2,0.259,25.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
70%,5.0,134.0,78.0,31.0,106.0,35.49,0.5637,38.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
df.describe(include=[int])
# Get summary statistics of integer columns only

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,81.0,1.0


In [6]:
df.describe(exclude=[int])
# Get summary statistics of non-integer columns only

Unnamed: 0,BMI,DiabetesPedigreeFunction
count,768.0,768.0
mean,31.992578,0.471876
std,7.88416,0.331329
min,0.0,0.078
25%,27.3,0.24375
50%,32.0,0.3725
75%,36.6,0.62625
max,67.1,2.42


In [7]:
df.describe().T
# Transpose summary statistics with .T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


## Understanding Data using `.info()`

The `.info()` method is a quick way to look at the data types, missing values, and data size of a DataFrame. Here, we’re setting the `show_counts` argument to `True` to get an overview of the total non-missing values in each column. We’re also setting `memory_usage` to `True` to show the total memory usage of the DataFrame elements. When `verbose` is set to `True`, it prints the full summary from `.info()`.

```python
df.info(show_counts=True, memory_usage=True, verbose=True)
```
This will give a detailed overview of the DataFrame.

In [7]:
df.info(show_counts=True, memory_usage=True, verbose=True)
# Get a quick overview of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             767 non-null    float64
 4   Insulin                   768 non-null    int64  
 5   BMI                       767 non-null    float64
 6   DiabetesPedigreeFunction  767 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(3), int64(6)
memory usage: 54.1 KB


## Understanding Data using `.shape`

The number of rows and columns of a DataFrame can be identified using the `.shape` attribute of the DataFrame. It returns a tuple `(row, column)` and can be indexed to get only the row or column count as output.

```python
df.shape
```
This will return the number of rows and columns.

```python
df.shape[0]
```
This will return the number of rows only.

```python
df.shape[1]
```
This will return the number of columns only.

In [8]:
df.shape
# Get the number of rows and columns

(768, 9)

In [9]:
df.shape[0]
# Get the number of rows only

768

In [10]:
df.shape[1]
# Get the number of columns only

9

## Getting All Columns and Column Names

Calling the `.columns` attribute of a DataFrame object returns the column names in the form of an `Index` object. As a reminder, a pandas index is the address/label of the row or column.

```python
df.columns
```
This will return the column names.

It can be converted to a list using the `list()` function.
```python
list(df.columns)
```
This will return the column names as a list.

In [12]:
df.columns
# Get all columns and column names

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [13]:
list(df.columns)
# Convert column names to a list

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

## Checking for Missing Values in Pandas with `.isnull()`

You can check whether each element in a DataFrame is missing using the `.isnull()` method.

```python
df.isnull().head(7)
```
This will return the first seven rows showing where missing values occur.

Given it's often more useful to know how much missing data you have, you can combine `.isnull()` with `.sum()` to count the number of nulls in each column.

```python
df.isnull().sum()
```
This will return the number of missing values in each column.

You can also do a double sum to get the total number of nulls in the DataFrame.

```python
df.isnull().sum().sum()
```
This will return the total number of missing values in the DataFrame.

In [23]:
df.isnull().head(7)
# Check for missing values in the first seven rows

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,False,False,True,False,True,False,False,False,False
1,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False
6,False,False,False,False,True,False,False,False,False


In [24]:
df.isnull()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,False,False,True,False,True,False,False,False,False
1,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
763,False,False,False,False,False,False,False,False,False
764,False,False,False,False,False,False,False,False,False
765,False,False,False,False,False,False,False,False,False
766,False,False,False,False,False,False,False,False,False


In [25]:
df.isnull().sum()
# Count the number of nulls in each column

Pregnancies                 0
Glucose                     0
BloodPressure               5
SkinThickness               0
Insulin                     2
BMI                         0
DiabetesPedigreeFunction    1
Age                         0
Outcome                     0
dtype: int64

In [26]:
df.isnull().sum().sum()
# Get the total number of nulls in the DataFrame

8