[![View slides in browser](https://img.shields.io/badge/view-slides-orange?logo=github)](https://stefmolin.github.io/pandas-workshop/slides/html/workshop.slides.html#/section-1)

---



In [None]:
# What is Pandas?
# Pandas is a Python library used for working with data sets.
# It has functions for analyzing, cleaning, exploring, and manipulating data.
# What can Pandas do?
# Pandas allows us to analyze big data and make conclusions based on statistical theories.
# Pandas can clean messy data sets, and make them readable and relevant.
# Pandas can help us to explore our data.
# Official site of Pandas: https://pandas.pydata.org/

# Section 1: Getting Started With Pandas

We will begin by introducing the `Series`, `DataFrame`, and `Index` classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

## Anatomy of a DataFrame

A **DataFrame** is composed of one or more **Series**. The names of the **Series** form the column names, and the row labels form the **Index**.

In [1]:
import pandas as pd

In [2]:
pd.__version__

'1.5.0'

In [3]:
# we start by reading some csv data into a pandas dataframe
# we use .. to go back one directory then we use / to go into the data directory

meteorites = pd.read_csv('../data/Meteorite_Landings.csv', nrows=5)
# nrows=5 means we only want to read the first 5 rows of the csv file
meteorites

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [4]:
# usually we would read the whole file
# very typical to call the dataframe df
df = pd.read_csv('../data/Meteorite_Landings.csv')
df.head() # here we get the first 5 rows of the dataframe by default

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [5]:
df.tail(7)  # here we get the last 7 rows of the dataframe

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
45709,Zhongxiang,30406,Valid,Iron,100000.0,Found,01/01/1981 12:00:00 AM,31.2,112.5,"(31.2, 112.5)"
45710,Zillah 001,31355,Valid,L6,1475.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,01/01/1999 12:00:00 AM,13.78333,8.96667,"(13.78333, 8.96667)"
45713,Zlin,30410,Valid,H4,3.3,Found,01/01/1939 12:00:00 AM,49.25,17.66667,"(49.25, 17.66667)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,01/01/2003 12:00:00 AM,49.78917,41.5046,"(49.78917, 41.5046)"
45715,Zulu Queen,30414,Valid,L3.7,200.0,Found,01/01/1976 12:00:00 AM,33.98333,-115.68333,"(33.98333, -115.68333)"


In [8]:
# we can also get the shape of the dataframe
df.shape

(45716, 10)

In [9]:
# get information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  object 
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 3.5+ MB


*Source: [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)*

#### Series:

In [10]:
meteorites.name # alternative way to get a column would be meteorites['name']

0      Aachen
1      Aarhus
2        Abee
3    Acapulco
4     Achiras
Name: name, dtype: object

In [None]:
# . notation for column access will not work if the column name has a space in it
# also . notation will not work if the column name is a python keyword for example class or def or if
# then you have to use the [] notation
# also [] notation is to be used when we use variables to access columns

In [12]:
# get year series
year = meteorites['year']
year

0    01/01/1880 12:00:00 AM
1    01/01/1951 12:00:00 AM
2    01/01/1952 12:00:00 AM
3    01/01/1976 12:00:00 AM
4    01/01/1902 12:00:00 AM
Name: year, dtype: object

In [13]:
type(year) # this is a pandas series

pandas.core.series.Series

In [None]:
# underneath Pandas series is a numpy array - think of it as a list which is very efficient
# supports vectorized operations
# stores data homogeneously - in one big block of memory - all the same type - unlike python lists
# takes about 8x less space than python lists
# supports fast operations
# supports fast indexing
# downside is that data has to be homogenous - same data type

![Dataframe](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

#### Columns:

In [None]:
# Pandas columns are series
# Pandas dataframes are a collection of series
# with the same index


In [15]:
meteorites.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

#### Index:

In [16]:
# get the index of the dataframe
meteorites.index
# pandas indexes can be used to access rows
# indexes can be integers or strings or dates

RangeIndex(start=0, stop=5, step=1)

## Creating DataFrames

We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just a couple of examples, but be sure to check out [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the documentation for a complete list.

### Using a flat file

In [5]:
# import pandas as pd

meteorites = pd.read_csv('../data/Meteorite_Landings.csv')

*Tip: There are many parameters to this function to handle some initial processing while reading in the file &ndash; be sure check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).*

### Using data from an API

Collect the data from [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh) using the Socrata Open Data API (SODA) with the `requests` library:

In [17]:
import requests # typically we would have the import at the top of the file

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50_000} # example of how to use params in your GET request
)

if response.ok:
    payload = response.json()
    print("Success!")
else:
    print(f'Request was not successful and returned code: {response.status_code}.')
    payload = None

Create the DataFrame with the resulting payload:

In [18]:
# import pandas as pd

df = pd.DataFrame(payload) # i overwrote the old df with the new one
# old one will be garbage collected automatically within seconds
# this means you do not need to delete it manually
df.head(3)

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}",,
1,Aarhus,2,Valid,H6,720,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}",,
2,Abee,6,Valid,EH4,107000,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}",,


*Tip: `df.to_csv('data.csv')` writes this data to a new file called `data.csv`.*

In [21]:
# to save to excel
# saving to excel might require you to install openpyxl - pip install openpyxl
df.to_excel('../data/nasa_meteorites.xlsx')
# excel is not a good format for storing data - it is not efficient
# also not good for working with large data sets over 1 million rows for sure even less

In [18]:
# let's load data from excel
df = pd.read_excel('../data/nasa_meteorites.xlsx')
# xlsx is not a good format for storing data - it is not efficient 
# also not good for working with large data sets over 1 million rows for sure even less
df.head()

Unnamed: 0.1,Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}",,
1,1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}",,
2,2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}",,
3,3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}",,
4,4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}",,


In [22]:
type(payload)

list

In [23]:
payload[:5]

[{'name': 'Aachen',
  'id': '1',
  'nametype': 'Valid',
  'recclass': 'L5',
  'mass': '21',
  'fall': 'Fell',
  'year': '1880-01-01T00:00:00.000',
  'reclat': '50.775000',
  'reclong': '6.083330',
  'geolocation': {'latitude': '50.775', 'longitude': '6.08333'}},
 {'name': 'Aarhus',
  'id': '2',
  'nametype': 'Valid',
  'recclass': 'H6',
  'mass': '720',
  'fall': 'Fell',
  'year': '1951-01-01T00:00:00.000',
  'reclat': '56.183330',
  'reclong': '10.233330',
  'geolocation': {'latitude': '56.18333', 'longitude': '10.23333'}},
 {'name': 'Abee',
  'id': '6',
  'nametype': 'Valid',
  'recclass': 'EH4',
  'mass': '107000',
  'fall': 'Fell',
  'year': '1952-01-01T00:00:00.000',
  'reclat': '54.216670',
  'reclong': '-113.000000',
  'geolocation': {'latitude': '54.21667', 'longitude': '-113.0'}},
 {'name': 'Acapulco',
  'id': '10',
  'nametype': 'Valid',
  'recclass': 'Acapulcoite',
  'mass': '1914',
  'fall': 'Fell',
  'year': '1976-01-01T00:00:00.000',
  'reclat': '16.883330',
  'reclong': 

## Inspecting the data
Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have. 

Let's inspect the `meteorites` data.

#### How many rows and columns are there?

In [7]:
meteorites.shape

(5, 10)

#### What are the column names?

In [8]:
meteorites.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

In [9]:
# you can rename columns
# you can also rename columns by using the . notation
meteorites.name = 'name_of_meteorite' # this actually assigns the same value to all rows in that column
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,name_of_meteorite,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,name_of_meteorite,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,name_of_meteorite,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,name_of_meteorite,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,name_of_meteorite,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [11]:
# we can reneame all columns at once, if you know how many colums you have from shape
meteorites.columns = ['name', 'id', 'nametype', 'recclass', 'mass', 'fall', 'year', 'lat', 'long', 'geolocation']
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,lat,long,geolocation
0,name_of_meteorite,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,name_of_meteorite,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,name_of_meteorite,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,name_of_meteorite,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,name_of_meteorite,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [13]:
# rename dataframe columns
# rename is a method of the dataframe
# it takes a dictionary as an argument
# the keys are the old column names
# the values are the new column names
# inplace=True means that the dataframe will be modified in place
# inplace=False means that the dataframe will be modified and a copy will be returned
# inplace=False is the default
# you can use spaces for column names but then you  have to use the [] notation to access them
meteorites.rename(columns={'name': 'name_of_meteorite', 'mass':'mass_in_g'}, inplace=True)
meteorites.head()

Unnamed: 0,name_of_meteorite,id,nametype,recclass,mass_in_g,fall,year,lat,long,geolocation
0,name_of_meteorite,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,name_of_meteorite,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,name_of_meteorite,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,name_of_meteorite,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,name_of_meteorite,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [None]:
df.columns

#### What type of data does each column currently hold?

In [15]:
# very common to check data types
meteorites.dtypes

name_of_meteorite     object
id                     int64
nametype              object
recclass              object
mass_in_g              int64
fall                  object
year                  object
lat                  float64
long                 float64
geolocation           object
dtype: object

#### What does the data look like?

In [11]:
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

In [16]:
meteorites.tail()

Unnamed: 0,name_of_meteorite,id,nametype,recclass,mass_in_g,fall,year,lat,long,geolocation
0,name_of_meteorite,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,name_of_meteorite,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,name_of_meteorite,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,name_of_meteorite,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,name_of_meteorite,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


#### Get some information about the DataFrame

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   45716 non-null  int64  
 1   name                         45716 non-null  object 
 2   id                           45716 non-null  int64  
 3   nametype                     45716 non-null  object 
 4   recclass                     45716 non-null  object 
 5   mass                         45585 non-null  float64
 6   fall                         45716 non-null  object 
 7   year                         45425 non-null  object 
 8   reclat                       38401 non-null  float64
 9   reclong                      38401 non-null  float64
 10  geolocation                  38401 non-null  object 
 11  :@computed_region_cbhk_fwbd  1659 non-null   float64
 12  :@computed_region_nnqa_25f4  1659 non-null   float64
dtypes: float64(5), i

### [Exercise 1.1](./workbook.ipynb#Exercise-1.1)

##### Create a DataFrame by reading in the `2019_Yellow_Taxi_Trip_Data.csv` file. Examine the first 5 rows.

In [14]:
# Complete exercise 1.1 in the workbook.ipynb file
# Click on `Exercise 1.1` above to open the workbook.ipynb file

# WARNING: if you complete the exercise here, your cell numbers
# for the rest of the training might not match the slides

### [Exercise 1.2](./workbook.ipynb#Exercise-1.2)

##### Find the dimensions (number of rows and number of columns) in the data.

In [15]:
# Complete exercise 1.2 in the workbook.ipynb file
# Click on `Exercise 1.2` above to open the workbook.ipynb file

# WARNING: if you complete the exercise here, your cell numbers
# for the rest of the training might not match the slides

## Extracting subsets

A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

#### Selecting columns

We can select columns as attributes if their names would be valid Python variables:

In [20]:
meteorites = df # this is just a reference to the same dataframe
# not a copy
# if you modify the dataframe you will modify the original dataframe
# if you want to make a copy you have to use the copy method
# meteorites = df.copy()
# this will make a copy of the dataframe

In [22]:
m_names = meteorites.name # this selects the column as a series
type(m_names)

pandas.core.series.Series

In [23]:
df.columns

Index(['Unnamed: 0', 'name', 'id', 'nametype', 'recclass', 'mass', 'fall',
       'year', 'reclat', 'reclong', 'geolocation',
       ':@computed_region_cbhk_fwbd', ':@computed_region_nnqa_25f4'],
      dtype='object')

If they aren't, we have to select them as keys. However, we can select multiple columns at once this way:

In [25]:
meteorites[['name', 'mass']].head() # this selects the columns as a dataframe

Unnamed: 0,name,mass
0,Aachen,21.0
1,Aarhus,720.0
2,Abee,107000.0
3,Acapulco,1914.0
4,Achiras,780.0


In [26]:
# so I want to get rid of first column in the dataframe
# I can use the drop method
# drop is a method of the dataframe
# it takes a list of columns to drop
# also I can select the columns I want to keep - sort of like a goodlist
df.columns

Index(['Unnamed: 0', 'name', 'id', 'nametype', 'recclass', 'mass', 'fall',
       'year', 'reclat', 'reclong', 'geolocation',
       ':@computed_region_cbhk_fwbd', ':@computed_region_nnqa_25f4'],
      dtype='object')

In [27]:
df = df[df.columns[1:]] # i am saying give me all the rows and all the columns except the first one
df.head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}",,
1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}",,
2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}",,
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}",,
4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}",,


In [28]:
# I do not want the last two columns either
# again I will use the select columns I want to keep method
df = df[df.columns[:-2]] # so I give a list of columns to select except the last two
df.head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}"
1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}"
2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"


In [29]:
# save dataframe to excel without index
df.to_excel("../data/nasa_meteorites.xlsx", index=False) # overwriting the old file

In [30]:
# save dataframe to csv without index
df.to_csv("../data/nasa_meteorites.csv", index=False) # overwriting the old file

In [2]:
df = pd.read_csv("../data/nasa_meteorites.csv")
df.head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}"
1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}"
2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"


In [3]:
meteorites=df

#### Selecting rows

In [4]:
meteorites[100:104]

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
100,Benton,5026,Valid,LL6,2840.0,Fell,1949-01-01T00:00:00.000,45.95,-67.55,"{'latitude': '45.95', 'longitude': '-67.55'}"
101,Berduc,48975,Valid,L6,270.0,Fell,2008-01-01T00:00:00.000,-31.91,-58.32833,"{'latitude': '-31.91', 'longitude': '-58.32833'}"
102,Béréba,5028,Valid,Eucrite-mmict,18000.0,Fell,1924-01-01T00:00:00.000,11.65,-3.65,"{'latitude': '11.65', 'longitude': '-3.65'}"
103,Berlanguillas,5029,Valid,L6,1440.0,Fell,1811-01-01T00:00:00.000,41.68333,-3.8,"{'latitude': '41.68333', 'longitude': '-3.8'}"


In [None]:
# turns out Pandas has two types of indexing for rows
# integer based indexing
# label based indexing
# integer based indexing is the default
# label based indexing is the one we want to use
# in this case our index is actually numbers

#### Indexing

We use `iloc[]` to select rows and columns by their position:

In [5]:
meteorites.iloc[2:7] # this is integer based indexing
# notice the difference between iloc and loc
# here index 7 is not included
# it will work no matter what the index is

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"
5,Adhi Kot,379,Valid,EH4,4239.0,Fell,1919-01-01T00:00:00.000,32.1,71.8,"{'latitude': '32.1', 'longitude': '71.8'}"
6,Adzhi-Bogdo (stone),390,Valid,LL3-6,910.0,Fell,1949-01-01T00:00:00.000,44.83333,95.16667,"{'latitude': '44.83333', 'longitude': '95.16667'}"


In [36]:
meteorites.loc[2:7] # this is label based indexing
# labels just happen to be numbers in this case
# here index 7 is included
# it is better to non-integer labels or maybe integer labels that are not sequential

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
2,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
4,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"
5,Adhi Kot,379,Valid,EH4,4239.0,Fell,1919-01-01T00:00:00.000,32.1,71.8,"{'latitude': '32.1', 'longitude': '71.8'}"
6,Adzhi-Bogdo (stone),390,Valid,LL3-6,910.0,Fell,1949-01-01T00:00:00.000,44.83333,95.16667,"{'latitude': '44.83333', 'longitude': '95.16667'}"
7,Agen,392,Valid,H5,30000.0,Fell,1814-01-01T00:00:00.000,44.21667,0.61667,"{'latitude': '44.21667', 'longitude': '0.61667'}"


In [37]:
# we can select specific rows and columns by iloc indexing
meteorites.iloc[100:104, [0, 3, 4, 6]]

Unnamed: 0,name,recclass,mass,year
100,Benton,LL6,2840.0,1949-01-01T00:00:00.000
101,Berduc,L6,270.0,2008-01-01T00:00:00.000
102,Béréba,Eucrite-mmict,18000.0,1924-01-01T00:00:00.000
103,Berlanguillas,L6,1440.0,1811-01-01T00:00:00.000


In [6]:
# let's check whether names are unique
# we can use the unique method of the series
# unique method returns a numpy array
# we can use the len function to get the length of the array
# if the length of the array is the same as the length of the series
# then all the values are unique
df.name.nunique(), len(df.name), df.name.nunique() == len(df.name)

(45716, 45716, True)

In [7]:
# that is good news we can use name as index
# we can use the set_index method of the dataframe
# set_index takes a list of columns to use as index
# inplace=True means that the dataframe will be modified in place
df.index = df.name
df.head() # with my approach I still have the name column, which I probably want to drop like we did before

Unnamed: 0_level_0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Aachen,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}"
Aarhus,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}"
Abee,Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
Acapulco,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
Achiras,Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"


In [8]:
df.set_index('name', inplace=True) # this should drop the column and use it as the index column
# if we did not use inplace=True we would have to assign the result to a new dataframe
# df = df.set_index('name') # this would work, this is the out of place approach
df.head()

Unnamed: 0_level_0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}"
Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}"
Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}"
Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,"{'latitude': '16.88333', 'longitude': '-99.9'}"
Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,"{'latitude': '-33.16667', 'longitude': '-64.95'}"


We use `loc[]` to select by name:

In [11]:
meteorites.iloc[22:27]

Unnamed: 0_level_0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alby sur Chéran,458,Valid,Eucrite-mmict,252.0,Fell,2002-01-01T00:00:00.000,45.82133,6.01533,"{'latitude': '45.82133', 'longitude': '6.01533'}"
Aldsworth,461,Valid,LL5,700.0,Fell,1835-01-01T00:00:00.000,51.78333,-1.78333,"{'latitude': '51.78333', 'longitude': '-1.78333'}"
Aleppo,462,Valid,L6,3200.0,Fell,1873-01-01T00:00:00.000,36.23333,37.13333,"{'latitude': '36.23333', 'longitude': '37.13333'}"
Alessandria,463,Valid,H5,908.0,Fell,1860-01-01T00:00:00.000,44.88333,8.75,"{'latitude': '44.88333', 'longitude': '8.75'}"
Alexandrovsky,465,Valid,H4,9251.0,Fell,1900-01-01T00:00:00.000,50.95,31.81667,"{'latitude': '50.95', 'longitude': '31.81667'}"


In [12]:
meteorites.loc['Acapulco':'Alessandria', 'mass':'year']

Unnamed: 0_level_0,mass,fall,year
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acapulco,1914.0,Fell,1976-01-01T00:00:00.000
Achiras,780.0,Fell,1902-01-01T00:00:00.000
Adhi Kot,4239.0,Fell,1919-01-01T00:00:00.000
Adzhi-Bogdo (stone),910.0,Fell,1949-01-01T00:00:00.000
Agen,30000.0,Fell,1814-01-01T00:00:00.000
Aguada,1620.0,Fell,1930-01-01T00:00:00.000
Aguila Blanca,1440.0,Fell,1920-01-01T00:00:00.000
Aioun el Atrouss,1000.0,Fell,1974-01-01T00:00:00.000
Aïr,24000.0,Fell,1925-01-01T00:00:00.000
Aire-sur-la-Lys,,Fell,1769-01-01T00:00:00.000


#### Filtering with Boolean masks

A **Boolean mask** is a array-like structure of Boolean values &ndash; it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`).

Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):

In [13]:
(meteorites.mass > 50) & (meteorites.fall == 'Found') # will return a series of booleans
# notice the & this is bitwise and
# there is also | for bitwise or
# there is also ~ for bitwise not - negation

name
Aachen        False
Aarhus        False
Abee          False
Acapulco      False
Achiras       False
              ...  
Zillah 002     True
Zinder        False
Zlin          False
Zubkovsky      True
Zulu Queen     True
Length: 45716, dtype: bool

**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:

In [16]:
meteorites[(meteorites['mass'] > 1e6) & (meteorites.fall == 'Fell')]

Unnamed: 0_level_0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Allende,2278,Valid,CV3,2000000.0,Fell,1969-01-01T00:00:00.000,26.96667,-105.31667,"{'latitude': '26.96667', 'longitude': '-105.31..."
Jilin,12171,Valid,H5,4000000.0,Fell,1976-01-01T00:00:00.000,44.05,126.16667,"{'latitude': '44.05', 'longitude': '126.16667'}"
Kunya-Urgench,12379,Valid,H5,1100000.0,Fell,1998-01-01T00:00:00.000,42.25,59.2,"{'latitude': '42.25', 'longitude': '59.2'}"
Norton County,17922,Valid,Aubrite,1100000.0,Fell,1948-01-01T00:00:00.000,39.68333,-99.86667,"{'latitude': '39.68333', 'longitude': '-99.866..."
Sikhote-Alin,23593,Valid,"Iron, IIAB",23000000.0,Fell,1947-01-01T00:00:00.000,46.16,134.65333,"{'latitude': '46.16', 'longitude': '134.65333'}"


*Tip: Boolean masks can be used with `loc[]` and `iloc[]`.*

An alternative to this is the `query()` method:


In [None]:
# link to query documentation
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

In [19]:
# backticks are for column names with spaces
meteorites.query("`mass` > 1e6 and fall == 'Fell'")

Unnamed: 0_level_0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Allende,2278,Valid,CV3,2000000.0,Fell,1969-01-01T00:00:00.000,26.96667,-105.31667,"{'latitude': '26.96667', 'longitude': '-105.31..."
Jilin,12171,Valid,H5,4000000.0,Fell,1976-01-01T00:00:00.000,44.05,126.16667,"{'latitude': '44.05', 'longitude': '126.16667'}"
Kunya-Urgench,12379,Valid,H5,1100000.0,Fell,1998-01-01T00:00:00.000,42.25,59.2,"{'latitude': '42.25', 'longitude': '59.2'}"
Norton County,17922,Valid,Aubrite,1100000.0,Fell,1948-01-01T00:00:00.000,39.68333,-99.86667,"{'latitude': '39.68333', 'longitude': '-99.866..."
Sikhote-Alin,23593,Valid,"Iron, IIAB",23000000.0,Fell,1947-01-01T00:00:00.000,46.16,134.65333,"{'latitude': '46.16', 'longitude': '134.65333'}"


*Tip: Here, we can use both logical operators and bitwise operators.*

## Calculating summary statistics

In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the `meteorites` data by calculating summary statistics.

#### How many of the meteorites were found versus observed falling?

In [20]:
meteorites.fall.unique()

array(['Fell', 'Found'], dtype=object)

In [21]:
meteorites.fall.value_counts()

Found    44609
Fell      1107
Name: fall, dtype: int64

*Tip: Pass in `normalize=True` to see this result as percentages. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) for additional functionality.*

#### What was the mass of the average meterorite?

In [22]:
meteorites['mass'].median()

32.6

We can take this a step further and look at quantiles:

In [24]:
meteorites['mass'].quantile([0.01, 0.05,0.5, 0.95, 0.99])

0.01        0.44
0.05        1.10
0.50       32.60
0.95     4000.00
0.99    50600.00
Name: mass, dtype: float64

In [26]:
import numpy as np
meteorites['mass'].quantile(list(np.arange(0.01, 1, 0.01))) # very detailed quantiles

0.01        0.4400
0.02        0.6000
0.03        0.7500
0.04        0.9036
0.05        1.1000
           ...    
0.95     4000.0000
0.96     6000.0000
0.97    10000.0000
0.98    19072.8000
0.99    50600.0000
Name: mass, Length: 99, dtype: float64

#### What was the mass of the heaviest meteorite?

In [27]:
meteorites['mass'].max()

60000000.0

Let's extract the information on this meteorite:

In [28]:
meteorites.loc[meteorites['mass'].idxmax()]

id                                                         11890
nametype                                                   Valid
recclass                                               Iron, IVB
mass                                                  60000000.0
fall                                                       Found
year                                     1920-01-01T00:00:00.000
reclat                                                 -19.58333
reclong                                                 17.91667
geolocation    {'latitude': '-19.58333', 'longitude': '17.916...
Name: Hoba, dtype: object

#### How many different types of meteorite classes are represented in this dataset?

In [29]:
meteorites.recclass.nunique()

466

Some examples:

In [30]:
meteorites.recclass.unique()[:14]

array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an'],
      dtype=object)

*Note: All fields preceded with "rec" are the values recommended by The Meteoritical Society. Check out [this Wikipedia article](https://en.wikipedia.org/wiki/Meteorite_classification) for some information on meteorite classes.*

#### Get some summary statistics on the data itself
We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:

In [30]:
meteorites.describe(include='all') # all columns numeric and non-numeric
# default is only numeric columns

Unnamed: 0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation
count,45716.0,45716,45716,45585.0,45716,45425,38401.0,38401.0,38401
unique,,2,466,,2,266,,,17100
top,,Valid,L6,,Found,2003-01-01T00:00:00.000,,,"{'latitude': '0.0', 'longitude': '0.0'}"
freq,,45641,8285,,44609,3323,,,6214
mean,26889.735104,,,13278.08,,,-39.12258,61.074319,
std,16860.68303,,,574988.9,,,46.378511,80.647298,
min,1.0,,,0.0,,,-87.36667,-165.43333,
25%,12688.75,,,7.2,,,-76.71424,0.0,
50%,24261.5,,,32.6,,,-71.5,35.66667,
75%,40656.75,,,202.6,,,0.0,157.16667,


**Important**: `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`).

#### Check out the documentation for more descriptive statistics:

- [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats)
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)

### [Exercise 1.3](./workbook.ipynb#Exercise-1.3)

##### Using the data in the `2019_Yellow_Taxi_Trip_Data.csv` file, calculate summary statistics for the `fare_amount`, `tip_amount`, `tolls_amount`, and `total_amount` columns.

In [32]:
# Complete exercise 1.3 in the workbook.ipynb file
# Click on `Exercise 1.3` above to open the workbook.ipynb file

### [Exercise 1.4](./workbook.ipynb#Exercise-1.4)

##### Find the dimensions (number of rows and number of columns) in the data.

In [33]:
# Complete exercise 1.4 in the workbook.ipynb file
# Click on `Exercise 1.4` above to open the workbook.ipynb file

## Up Next: [Data Wrangling](./2-data_wrangling.ipynb)