## Lesson 1: Pandas Beginnings

#### Lesson Overview
First, we will introduce a `DataFrame`, the core data structure of the *Pandas* library, and walk through some basics about working with them.

#### Lesson Goals
By the end of this lesson you should be able to
1. Load CSV data into a DataFrame
2. Perform DataFrame operations to inspect, filter, and calculate statistics about the data

In [None]:
# Load necessary packages for this lesson
import os

import pandas as pd

#### DataFrame Structure
A **DataFrame** is a heterogeneous table of data with column names. Each row in a DataFrame corresponds to one data sample, which consists of an entry for every single column. Consider the example in the following cell

* This DataFrame has column names: 'name', 'id', 'nametype', 'recclass', 'mass (g)' etc.

* The first row is a meteorite named 'Aachen' with id 1, a Valid nametype, L5 recclass, 21 gram mass, etc. We consider this row as a single data sample since it contains all information for one meteorite. Row two then contains all info for another meteorite.

In [None]:
filepath = os.path.realpath(os.path.join(os.getcwd(), '..', 'data', 'Meteorite_Landings.csv')) # Path to the data location
print(filepath) # display path to file for example purposes only
meteorites = pd.read_csv(filepath, nrows=5) # Load 5 rows of the data

meteorites # display the data

#### Load data from a CSV

In the previous cell, we showed an example of loading in a dataset about Meteorite Landings (*Source: [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)*) from a CSV file

* We use the command/function: *pd.read_csv* which requires one input a string (or os.PathLike) specifying the file location 

<center> pd.read_csv(filepath) </center>

(*Note: There are many optional inputs to this function that handle some initial processing while reading in the file &ndash; check out the [documentation for all inputs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). Much of the time these optional inputs are unncessary and when they are I find a Google search or ChatGPT question about how I want to load in the data will tell me what to provide for these optional inputs.)*

##### Lets break down the previous cell in greater detail

1. Specify the path to the dataset
<center> filepath = os.path.realpath(os.path.join(os.getcwd(), '..', 'data', 'Meteorite_Landings.csv')) </center>

* *os.getcwd()* returns the path to your current working directory
* *os.path.join(os.getcwd(), '..', 'data', 'meteorite_Landings.csv')* joins together all inputs *os.getcwd()*, '..', 'data', and 'Meteorite_Landings.csv' into a single path. (Essentially it adds a backslash \ between each input)
	* Remark: the '..' actually tells Python to remove the previous folder from the path, which we can see using *os.path.realpath*
* *os.path.realpath(os.path.join(os.getcwd(), '..', 'data', 'meteorite_Landings.csv'))* this creates the "real path" by applying the operation that '..' performs

You can see these *os* commands in action by running the next cell. Note that a string specifying the path could be used instead of using these *os* commands. For example, open file explorer and navigate to the folder containing the file you want to load in. Right click the file, select "Properties" and a window will pop up. In this window locate the line "Location" and this is the file path to your file of interest. You can copy this as paste it into a string and use this as your filepath.

2. Load in the dataset
<center> meteorites = pd.read_csv(filepath, nrows=5) </center>

* This loads in the first 5 rows of the CSV file 'Meteorite_Landings.csv'

In [None]:
print(os.getcwd()) # should look like '<path to ai-workshop folder>\ai-workshop\pandas-notebooks'
print(os.path.join(os.getcwd(), '..', 'data', 'Meteorite_Landings.csv'))	# should look like '<path to ai-workshop folder>\ai-workshop\pandas-notebooks\..\data\Meteorite_Landings.csv'
print(os.path.realpath(os.path.join(os.getcwd(), '..', 'data', 'Meteorite_Landings.csv')))	# should look like '<path to ai-workshop folder>\ai-workshop\data\Meteorite_Landings.csv'

#### Inspecting the data
Now that we have some data, lets perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, what type of data we have etc.

First we will load the entire dataset by dropping the *nrows=5* optional input for *pd.read_csv*

In [None]:
meteorites = pd.read_csv(filepath) # load in the full dataset

How many rows and columns are there?

In [None]:
meteorites.shape

What are the column names?

In [None]:
meteorites.columns

What type of data does each column currently hold?

In [None]:
meteorites.dtypes

What does the data look like?

In [None]:
meteorites.head() # display first 5 rows of the DataFrame

Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

In [None]:
meteorites.tail() # display final 5 rows of the DataFrame

Get some summarized information about the DataFrame

In [None]:
meteorites.info()

#### Extracting subsets

An important part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

#### Selecting columns

We can select columns as attributes if their names would be valid Python variables:

In [None]:
meteorites.name

Or we can select columns as dictionary string keys. Selecting columns must be done in this way in the column name is not a valid Python variable. For example the column 'mass (g)' must be selected as a dictionary key

In [None]:
meteorites['name']

We can also select multiple columns at once using a list of dictionary string keys

In [None]:
meteorites[['name', 'mass (g)']]

#### Selecting rows

We can select rows using standard Python list slicing

In [None]:
meteorites[100:104] # select rows 100 - 103

#### Indexing: Selecting rows and columns

We use `iloc[]` to select rows and columns by their position:

In [None]:
meteorites.iloc[100:104, [0, 3, 4, 6]] # select rows 100 - 103 and columns at index 0, 3, 4, and 6

We use `loc[]` to select by name:

In [None]:
meteorites.loc[100:104, 'mass (g)':'year'] # select rows 100 - 103 and columns 'mass (g)' - 'year'

In [None]:
meteorites.loc[100:104, ['name', 'mass (g)', 'year']] # select rows 100 - 103 and columns 'name', 'mass (g)', and 'year'

#### Filtering with Boolean masks

A **Boolean** is a True or False value

A **Boolean mask** is a array-like structure of Boolean values &ndash; it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`)

Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):

*(Note the syntax. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`))*

In [None]:
mask = (meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')
print(mask)

We can use this mask to select the subset of meteorites satisfying the condition that they weight more than 50 grams and were found on Earth

In [None]:
meteorites[mask]

Here is another Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:

In [None]:
mask = (meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')
meteorites[mask]

An alternative to the Boolean masks above is the `query()` method of a DataFrame:

*(Note, in the `query()` method we can use both logical operators and bitwise operators)*

In [None]:
meteorites.query("`mass (g)` > 1e6 and fall == 'Fell'")

We can combine Boolean masks and `query()` with `loc[]` and `iloc[]`

In [None]:
meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')].loc[0:500, ['name', 'mass (g)', 'year']] # from the first 500 meteorites select the ones weighing more than 1 million grams that were observed falling and display their name, mass, and what year they fell

#### Calculating summary statistics

Next, we will discuss how we can calcualate various statistics of our dataset to gain some valuable insights

How many of the meteorites were found versus observed falling?

*(Note, pass in `normalize=True` to see this result as percentages. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) for additional functionality.)*

In [None]:
meteorites.fall.value_counts()

What was the mass of the average meterorite?

*(Note, the mean is being skewed upwards by some very heavy meteorites &ndash; the distribution is [right-skewed or positive-skewed](https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/))*

In [None]:
meteorites['mass (g)'].mean()

Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here

In [None]:
meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

A better measure in this case is the median (50th percentile), since it is robust to outliers:

In [None]:
meteorites['mass (g)'].median()

What was the mass of the heaviest meteorite?

In [None]:
meteorites['mass (g)'].max()

Let's extract the information on this meteorite:

*(Note, `idxmax()` is a method that returns an index of the maximum entry)*

In [None]:
meteorites.loc[meteorites['mass (g)'].idxmax()]

How many different types of meteorite classes are represented in this dataset?

*(Note, check out [this Wikipedia article](https://en.wikipedia.org/wiki/Meteorite_classification) for some information on meteorite classes.)*

In [None]:
print(meteorites.recclass.nunique()) # display number of classes

print(meteorites.recclass.unique()[:14]) # display a few examples of the unique classes

We can get common summary statistics for all columns at once. By default (i.e., not removing *include='all'*), this will only be numeric columns, but here, we will summarize everything together:

*(Note, `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`).)*

In [None]:
meteorites.describe(include='all')

Check out the documentation for more descriptive statistics:

- [DataFrame Stats](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)

#### Check out lesson 1 the [workbook](./workbook.ipynb) for practice examples