# Importing and inspecting data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec13_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

## Import data

In [None]:
# read a csv in your working directory
df = pd.read_csv('earthquakes.csv')
df.head(2)

In [None]:
# read a csv online
df = pd.read_csv('https://github.com/bethanyj0/data271_sp24/blob/main/demos/earthquakes.csv?raw=True')
df.head(2)

In [None]:
# read in an excel file
df = pd.read_excel('earthquakes.xlsx',sheet_name = 'earthquakes')
df.head(2)

In [None]:
# Read in an excel file online
df = pd.read_excel('https://github.com/bethanyj0/data271_sp24/blob/main/demos/earthquakes.xlsx?raw=True',sheet_name = 'earthquakes')
df.head(2)

In [None]:
# To use one of the columns as your row index
df = pd.read_csv('earthquakes.csv',index_col='code')
df.head(2)

In [None]:
df.index.name

### Initial Inspection of the Data

In [None]:
# Is the data frame empty? Did import fail?
df.empty

In [None]:
# display the top few rows
df.head(3)

In [None]:
# inspecting the last three rows
df.tail(3)

In [None]:
# info() gives more information, including the number of non-nulls
df.info()

## Pandas Methods

In [None]:
# To reset the index
df.reset_index().head(2)

In [None]:
# Doesn't update the original
df.head(2)

In [None]:
# Reset the index in the original
df.reset_index(inplace=True)
df.head(2)

In [None]:
# obtain summary statistics for numeric columns
df.describe()

In [None]:
# if we would like to just describe one column, such as mag (magnitude)
df.mag.describe()

In [None]:
# we can look for unique values in a column
df.status.unique()

In [None]:
# Get the number of rows in each category
df.status.value_counts()

In [None]:
# Get the number of non-null values in a column
df.felt.count()

In [None]:
# mean of a column
df.mag.mean()

In [None]:
# median
df.mag.median()

In [None]:
# quantile
df.mag.quantile(0.5)

In [None]:
# sum of a column
df.mag.sum()

In [None]:
# min of a column
df.mag.min()

In [None]:
# max of a column
df.mag.max()

In [None]:
# POSITION of maximum (can also use min)
df.mag.argmax()

In [None]:
# INDEX LABEL of maximum (can also use min)
df.mag.idxmax()

In [None]:
# Sort values in a series
df.mag.sort_values()

In [None]:
# Sort values rows in a dataframe by a value
df.sort_values(by='mag')

In [None]:
# Certain numeric methods won't automatically work on dataframes
#df.max()

In [None]:
# You can do multiple columns at once if all numeric
df.loc[:,['mag','gap']].max()

In [None]:
# Get the average of one column based on another column 
df.groupby('status')['mag'].mean()

In [None]:
# Get the average of multiple columns based on another column 
df.groupby('status')[['mag','gap']].mean()

### Selecting subsets

In [None]:
# select all columns with object datatypes
df.select_dtypes(object)

In [None]:
# select all columns with ints
df.select_dtypes(int)

In [None]:
# select all columns with numeric datatypes
df.select_dtypes('number')

### Filtering DataFrames with boolean indexing

In [None]:
# keep only the rows where this boolean statement is true (mag greater than or equal to 7)
df[df.mag >= 7]

In [None]:
# important columns for earthquakes with magnitude greater than or equal to 7 OR caused a tsunami
df.loc[
    (df.tsunami == 1) | (df.mag >= 7),
    ['mag', 'title', 'tsunami', 'place']
].head(3)

In [None]:
# Just get the earthquakes in California
df.loc[
    (df.place.str.contains('California')),
    ['mag', 'title', 'tsunami', 'place']
]

In [None]:
# We might have missed some-- the USGS has tagged some locations as California and some as CA.
cali_df = df.loc[
    (df.place.str.contains('CA|California')),
    ['mag', 'title', 'tsunami', 'place']
]
cali_df.head(3)

In [None]:
# if we just want the columns related to magnitude
df.loc[
    (df.place.str.contains('CA|California')),
    [col for col in df.columns if 'mag' in col]
].head(3)

## Activity 
Create a DataFrame with two rows and 2 columns. The columns should be `mag` and `place`. The first row should contain the information for the smallest earthquake in California (lowest magnitude) and the second row should contain information for the largest earthquake) in California.

How many earthquakes in the dataset had a red alert?

How many Oregon earthquakes are in the dataset?