# Inspecting a DataFrame Object

## About the Data
In this notebook, we will be working with earthquake data from September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))

## Setup
We will be working with the `data/earthquakes.csv` file again, so we need to handle our imports and read it in.

In [4]:
import numpy as np
import pandas as pd

df = pd.read_csv('earthquakes.csv')

## Examining dataframes
### Is it empty?

In [11]:
df.empty

False

### What are the dimensions?

In [6]:
df.shape

(13894, 26)

### What columns do we have?
We know there are 26 columns, but what are they? Let's use the `columns` attribute to see:

In [7]:
df.columns

Index(['mag', 'place', 'time', 'updated', 'tz', 'url', 'detail', 'felt', 'cdi',
       'mmi', 'alert', 'status', 'tsunami', 'sig', 'net', 'code', 'ids',
       'sources', 'types', 'nst', 'dmin', 'rms', 'gap', 'magType', 'type',
       'title'],
      dtype='object')

### What does the data look like?
View rows from the top with `head()`:

In [12]:
df.head()

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
0,-0.23,"15 km WSW of Dutch Harbor, Alaska",1631922462560,1631937139320,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",av91041373,",",av,",",origin,phase-data,",6.0,,0.27,100.0,ml,earthquake,"M -0.2 - 15 km WSW of Dutch Harbor, Alaska"
1,4.2,"20 km NNE of Honaz, Turkey",1631922330700,1631937449264,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.0,3.4,,...,",us7000fca0,",",us,",",dyfi,origin,phase-data,",,0.846,0.76,37.0,mb,earthquake,"M 4.2 - 20 km NNE of Honaz, Turkey"
2,1.08,"1km ESE of Warner Springs, CA",1631922235240,1631923421910,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ci39812199,",",ci,",",focal-mechanism,nearby-cities,origin,phase-da...",51.0,0.08135,0.15,35.0,ml,earthquake,"M 1.1 - 1km ESE of Warner Springs, CA"
3,3.1,"8 km NW of Harding-Birch Lakes, Alaska",1631921599431,1632001703819,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,5.0,3.4,,...,",ak021bydmbf5,us7000fc9z,",",ak,us,",",dyfi,origin,phase-data,",,,0.66,,ml,earthquake,"M 3.1 - 8 km NW of Harding-Birch Lakes, Alaska"
4,0.84,"15 km SE of Lincoln, Montana",1631921499620,1631975710640,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",mb80523504,",",mb,",",origin,phase-data,",13.0,0.043,0.16,60.0,ml,earthquake,"M 0.8 - 15 km SE of Lincoln, Montana"


View rows from the bottom with `tail()`. Let's view 2 rows:

In [13]:
df.tail(2)

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
13892,0.91,"6km SSE of Mentone, CA",1629331747910,1629332491684,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ci40014696,",",ci,",",nearby-cities,origin,phase-data,scitech-link,",18.0,0.08895,0.18,104.0,ml,earthquake,"M 0.9 - 6km SSE of Mentone, CA"
13893,1.4,"49 km WSW of Cantwell, Alaska",1629331456759,1630703456944,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ak021am1629k,",",ak,",",origin,phase-data,",,,0.43,,ml,earthquake,"M 1.4 - 49 km WSW of Cantwell, Alaska"


*Tip: we can modify the display options in order to see more columns:*

```python
# check the max columns setting
>>> pd.get_option('display.max_columns')
20

# set the max columns to show when printing the dataframe to 26
>>> pd.set_option('display.max_columns', 26)
# OR
>>> pd.options.display.max_columns = 26

# reset the option
>>> pd.reset_option('display.max_columns')

# get information on all display settings
>>> pd.describe_option('display')
```

*More information can be found in the documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).*

### What data types do we have?

In [14]:
df.dtypes

mag        float64
place       object
time         int64
updated      int64
tz         float64
url         object
detail      object
felt       float64
cdi        float64
mmi        float64
alert       object
status      object
tsunami      int64
sig          int64
net         object
code        object
ids         object
sources     object
types       object
nst        float64
dmin       float64
rms        float64
gap        float64
magType     object
type        object
title       object
dtype: object

### Getting extra info and finding nulls

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13894 entries, 0 to 13893
Data columns (total 26 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   mag      13891 non-null  float64
 1   place    13862 non-null  object 
 2   time     13894 non-null  int64  
 3   updated  13894 non-null  int64  
 4   tz       0 non-null      float64
 5   url      13894 non-null  object 
 6   detail   13894 non-null  object 
 7   felt     692 non-null    float64
 8   cdi      692 non-null    float64
 9   mmi      155 non-null    float64
 10  alert    67 non-null     object 
 11  status   13894 non-null  object 
 12  tsunami  13894 non-null  int64  
 13  sig      13894 non-null  int64  
 14  net      13894 non-null  object 
 15  code     13894 non-null  object 
 16  ids      13894 non-null  object 
 17  sources  13894 non-null  object 
 18  types    13894 non-null  object 
 19  nst      9571 non-null   float64
 20  dmin     8432 non-null   float64
 21  rms      138

## Describing and Summarizing
### Get summary statistics

In [16]:
df.describe()

Unnamed: 0,mag,time,updated,tz,felt,cdi,mmi,tsunami,sig,nst,dmin,rms,gap
count,13891.0,13894.0,13894.0,0.0,692.0,692.0,155.0,13894.0,13894.0,9571.0,8432.0,13888.0,10788.0
mean,1.558236,1630470000000.0,1630826000000.0,,14.982659,2.541329,3.251877,0.000648,57.909385,20.056525,0.765241,0.279889,112.432183
std,1.147013,719798600.0,736215600.0,,105.561402,1.322223,1.425208,0.025444,89.999596,14.76786,2.505021,0.271166,59.469183
min,-1.23,1629331000000.0,1629332000000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0
25%,0.83,1629852000000.0,1630266000000.0,,1.0,2.0,2.4695,0.0,11.0,10.0,0.027,0.09,68.0
50%,1.36,1630373000000.0,1630811000000.0,,1.0,2.2,3.31,0.0,28.0,15.0,0.06238,0.16,97.0
75%,1.96,1631036000000.0,1631489000000.0,,4.0,3.4,3.936,0.0,59.0,25.0,0.151,0.42,144.1825
max,7.1,1631922000000.0,1632070000000.0,,2259.0,8.3,7.746,1.0,2520.0,148.0,45.477,1.87,359.0


Specifying the 5<sup>th</sup> and 95<sup>th</sup> percentile:

In [17]:
df.describe(percentiles=[0.05, 0.95])

Unnamed: 0,mag,time,updated,tz,felt,cdi,mmi,tsunami,sig,nst,dmin,rms,gap
count,13891.0,13894.0,13894.0,0.0,692.0,692.0,155.0,13894.0,13894.0,9571.0,8432.0,13888.0,10788.0
mean,1.558236,1630470000000.0,1630826000000.0,,14.982659,2.541329,3.251877,0.000648,57.909385,20.056525,0.765241,0.279889,112.432183
std,1.147013,719798600.0,736215600.0,,105.561402,1.322223,1.425208,0.025444,89.999596,14.76786,2.505021,0.271166,59.469183
min,-1.23,1629331000000.0,1629332000000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0
5%,0.18,1629441000000.0,1629642000000.0,,0.0,1.0,1.0,0.0,0.0,5.0,0.007459,0.03,44.0
50%,1.36,1630373000000.0,1630811000000.0,,1.0,2.2,3.31,0.0,28.0,15.0,0.06238,0.16,97.0
95%,4.4,1631740000000.0,1631903000000.0,,27.45,4.8,5.7202,0.0,298.0,48.0,6.05205,0.84,231.0
max,7.1,1631922000000.0,1632070000000.0,,2259.0,8.3,7.746,1.0,2520.0,148.0,45.477,1.87,359.0


Describe specific data types:

In [18]:
df.describe(include=np.object)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df.describe(include=np.object)


Unnamed: 0,place,url,detail,alert,status,net,code,ids,sources,types,magType,type,title
count,13862,13894,13894,67,13894,13894,13894,13894,13894,13894,13891,13894,13894
unique,6457,13894,13894,3,2,15,13891,13894,73,71,9,7,10507
top,"7 km SW of Volcano, Hawaii",https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,green,reviewed,ak,2021qjsh,",av91041373,",",ak,",",origin,phase-data,",ml,earthquake,"M 0.7 - 7 km SW of Volcano, Hawaii"
freq,557,1,1,65,10961,3100,2,1,2785,8212,8647,13518,46


Or describe all of them:

In [19]:
df.describe(include='all')

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
count,13891.0,13862,13894.0,13894.0,0.0,13894,13894,692.0,692.0,155.0,...,13894,13894,13894,9571.0,8432.0,13888.0,10788.0,13891,13894,13894
unique,,6457,,,,13894,13894,,,,...,13894,73,71,,,,,9,7,10507
top,,"7 km SW of Volcano, Hawaii",,,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",av91041373,",",ak,",",origin,phase-data,",,,,,ml,earthquake,"M 0.7 - 7 km SW of Volcano, Hawaii"
freq,,557,,,,1,1,,,,...,1,2785,8212,,,,,8647,13518,46
mean,1.558236,,1630470000000.0,1630826000000.0,,,,14.982659,2.541329,3.251877,...,,,,20.056525,0.765241,0.279889,112.432183,,,
std,1.147013,,719798600.0,736215600.0,,,,105.561402,1.322223,1.425208,...,,,,14.76786,2.505021,0.271166,59.469183,,,
min,-1.23,,1629331000000.0,1629332000000.0,,,,0.0,0.0,0.0,...,,,,0.0,0.0,0.0,12.0,,,
25%,0.83,,1629852000000.0,1630266000000.0,,,,1.0,2.0,2.4695,...,,,,10.0,0.027,0.09,68.0,,,
50%,1.36,,1630373000000.0,1630811000000.0,,,,1.0,2.2,3.31,...,,,,15.0,0.06238,0.16,97.0,,,
75%,1.96,,1631036000000.0,1631489000000.0,,,,4.0,3.4,3.936,...,,,,25.0,0.151,0.42,144.1825,,,


This works on columns also:

In [20]:
df.felt.describe()

count     692.000000
mean       14.982659
std       105.561402
min         0.000000
25%         1.000000
50%         1.000000
75%         4.000000
max      2259.000000
Name: felt, dtype: float64

There are methods for specific statistics as well. Here is a sampling of them:

| Method | Description | Data types |
| --- | --- | --- |
| `count()` | The number of non-null observations | Any |
| `nunique()` | The number of unique values | Any |
| `sum()` | The total of the values | Numerical or Boolean |
| `mean()` | The average of the values | Numerical or Boolean |
| `median()` | The median of the values | Numerical |
| `min()` | The minimum of the values | Numerical |
| `idxmin()` | The index where the minimum values occurs | Numerical |
| `max()` | The maximum of the values | Numerical |
| `idxmax()` | The index where the maximum value occurs | Numerical |
| `abs()` | The absolute values of the data | Numerical |
| `std()` | The standard deviation | Numerical |
| `var()` | The variance |  Numerical |
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a `DataFrame` | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical |
| `quantile()` | Calculates a specific quantile | Numerical |
| `cumsum()` | The cumulative sum | Numerical or Boolean |
| `cummin()` | The cumulative minimum | Numerical |
| `cummax()` | The cumulative maximum | Numerical |

For example, finding the unique values in the `alert` column:

In [21]:
df.alert.unique()

array([nan, 'green', 'orange', 'red'], dtype=object)

We can then use `value_counts()` to see how many of each unique value we have:

In [22]:
df.alert.value_counts()

green     65
orange     1
red        1
Name: alert, dtype: int64

Note that `Index` objects also have several methods to help describe and summarize our data:

| Method | Description |
| --- | --- |
| `argmax()`/`argmin()` | Find the location of the maximum/minimum value in the index |
| `equals()` | Compare the index to another `Index` object for equality |
| `isin()` | Check if the index values are in a list of values and return an array of Booleans |
| `max()`/`min()` | Find the maximum/minimum value in the index |
| `nunique()` | Get the number of unique values in the index |
| `to_series()` | Create a `Series` object from the index |
| `unique()` | Find the unique values of the index |
| `value_counts()`| Create a frequency table for the unique values in the index |

<hr>
<div>
    <a href="./3-making_dataframes_from_api_requests.ipynb">
        <button style="float: left;">&#8592; Previous Notebook</button>
    </a>
    <a href="./5-subsetting_data.ipynb">
        <button style="float: right;">Next Notebook &#8594;</button>
    </a>
</div>
<br>
<hr>