<img src='https://pandas.pydata.org/docs/_static/pandas.svg' width=500>

https://pandas.pydata.org/

Pandas data-manipulation capabilities are built on top of NumPy, utilizing its fast array processing, and its graphing capabilities are built on top of Matplotlib.

* "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."

* It may be one of the most widely used tools for data munging

  * present data in nice formats
  * multiple convenient methods for filtering data
  * work with a variety of data formats (CSV, Excel, …)
  * convenient functions for quickly plotting data

* The name comes from panel data (and is also a play on python data analysis)

Import the library:

In [None]:
import pandas as pd

Import the data:

In [None]:
dinodata = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/DatasaurusDozen.csv')

Look at some data values:

In [None]:
dinodata

Making a scatter plot of the data in the `x` and `y` columns is easy:

In [None]:
dinodata.plot.scatter(x = 'x', y = 'y')

# What are we looking at?

The above command looks easy, but there's more to the data than this.

First let's cover a couple things about Pandas.

## Basic data structures in Pandas

Python can store values in a variety of data structures: single variables, lists, dictionaries, sets, etc.  

Pandas has two key data structures for storing Python variables:

1. Series
    * 1D
    * Like an array
    * Items are labeled by an index
2. Dataframes
    * 2D
    * Like a spreadsheet
    * Items are labeled by an index (row label) and column name

`dinodata` is a dataframe:

In [None]:
dinodata

* `head(n)`: show us the first `n` rows (5 by default)
* `tail(n)`: show us the last `n` rows (5 by default)
* `info()` : a range of summary info

In [None]:
dinodata.head()

In [None]:
dinodata.tail()

In [None]:
dinodata.info()

There are several other useful dataframe attributes and methods that will allow you to get summary info:
* `columns` : column names
* `dtypes` : data types of the columns (dataframes can hold different datatypes in different columns)
* `index` : information about the row indices (they don't have to be numerical)
* `shape` : the size of the dataframe in each dimension
* `describe()` : basic statistics about the data columns

In [None]:
dinodata.columns

In [None]:
dinodata.dtypes

In [None]:
dinodata.index

In [None]:
dinodata.shape

In [None]:
dinodata.shape[0]

In [None]:
dinodata.describe()

In [None]:
dinodata.describe(include = 'all')

If you select one of the columns of `dinodata` you'll get a Series in return:

In [None]:
dinodata['dataset']

## Selecting data from a dataframe

If you have a dataframe `df` and want to look at a specific column `columnname`, use `df['columnname']`

In [None]:
dinodata['x']

Dataframe can have both numerical- and label-based indices.  There is specific data retrieval syntax that accommodates this.

In [None]:
# This will give an error!
dinodata[0]

In [None]:
# This will not give an error
dinodata[0:1]

It's best to stick with `loc` and `iloc` for the moment to index dataframes.
* `loc` : label-based indexing (which can look numerical if the row index is a number)
* `iloc` : numerical indexing

In [None]:
dinodata

In [None]:
# the first row;
# return value is a Series

dinodata.loc[0]

In [None]:
# the first row;
# return value is a Dataframe (note the index is [0])

dinodata.loc[[0]]

In [None]:
# the first two rows;
# return value is a Dataframe (the index is a list [0,1])

dinodata.loc[[0,1]]

In [None]:
# indexing both the row and column

dinodata.loc[0,'x']

In [None]:
# indexing both the row and column
# and returning a dataframe

dinodata.loc[[0],['x']]

In [None]:
# you can use lists for the indices

dinodata.loc[[0],['x','y']]

In [None]:
dinodata.loc[[0,10],['x','y']]

`iloc` is useful when you instead want to index numerically.

In [None]:
dinodata.iloc[0]

Before you execute the below, try to predict whether it will return a Series or a Dataframe.

In [None]:
dinodata.iloc[1]

In [None]:
dinodata.iloc[[1]]

In [None]:
dinodata.iloc[0:1]

In [None]:
dinodata.iloc[0:4]

In [None]:
# this will give an error! you can't use a label-based index like 'x' with iloc

dinodata.iloc[0:4,'x']

In [None]:
# instead of referncing the column with 'x'
# iloc indexes it numerically

dinodata.iloc[0:4, 1]

In [None]:
dinodata.iloc[[0,1,2,3],[1]]

In [None]:
dinodata.iloc[0,0]

In [None]:
dinodata.loc[0,'dataset']

## What's the data for 'dataset' == 'dino'?  Boolean indexing

It is useful to be able to get elements where certain conditions are true.

Like here, we may want to get only those rows that are part of the 'dino' dataset.

This can be accomplished with boolean indexing, where the index is a True/False condition, and there is one such value for every row.

The following sets up the boolean series of True/False values for every row.

In [None]:
dinodata['dataset'] == 'dino'

We can use that as the index to dinodata, i.e., for any dataframe `df` we can use `df[condition]` to get only those rows where `condition` is True

In [None]:
dinodata[dinodata['dataset'] == 'dino']

In [None]:
# Check:  What is the condition in the above command?
# Write it here and execute:



Boolean indexing also works as the index when using `loc`

In [None]:
dinodata.loc[dinodata['dataset'] == 'dino']

In [None]:
dinodata.loc[dinodata['dataset'] == 'circle']

Note above what happens to the indices.  You should keep in mind this behavior if you want to index the returned result.

In [None]:
dinodata.loc[dinodata['dataset'] == 'dino','x']

In [None]:
dinodata['dataset'].str.startswith('d')

In [None]:
dinodata.loc[dinodata['dataset'].str.startswith('d')]

In [None]:
dinodata.loc[dinodata['dataset'].str.contains('in')]

Dataframes have many very useful methods.

... which we will ignore for the moment until next week when we get to exploratory data analysis.

For now:  plotting!

## Plotting

Let's make a scatter plot with only the `dino` dataset

In [None]:
# How do we get that subset of data?

a = dinodata[dinodata['dataset'] == 'dino']

Make a plot:

In [None]:
a.plot(x='x', y='y')

What's with the zig-zags?

By default, pandas will make a line plot connecting the points, and since the points are plotted out of numerical order, the connecting lines zigs back and forth in the x and y direction.

We actually want to plot this as a scatter plot instead of a line plot.

In [None]:
a.plot(x='x', y='y', kind='scatter')

The `kind` parameter makes it very easy to make a variety of different elementary plots:

* `line` : line plot
* `bar` : vertical bar plot
* `barh` : horizontal bar plot
* `hist` : histogram
* `box` : boxplot
* `kde` : kernel density estimation plot
* `density` : same as kde
* `area` : area plot
* `pie` : pie plot
* `scatter` : scatter plot
* `hexbin` : hexbin plot

In [None]:
a.iloc[0:20].plot(x='x', y='y', kind='bar')

This, of course, doesn't mean that you don't have to think about what you want to plot before-hand.

To make things easier, let's look at a subset of data:

In [None]:
b = a[0:15]

In [None]:
b

In [None]:
b.plot(x='x', y='y', kind='bar')

Note that pandas does not necessarily try to order the x-axis here for us.

In [None]:
b.sort_values(by='x')

In [None]:
b.sort_values(by='x').plot()

In [None]:
b.sort_values(by='x').plot(x='x', y='y', kind='bar')

In [None]:
b.sort_values(by='x').plot(x='x', y='y', kind='barh')

In [None]:
a.plot(x='x',
       y='y',
       kind='scatter')

In [None]:
a.plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5))

In [None]:
a.plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata')

In [None]:
a.plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black')

In [None]:
dinodata[dinodata['dataset']=='dino'].plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black')

In [None]:
dinodata[dinodata['dataset']=='dino'].plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black')
dinodata[dinodata['dataset']=='circle'].plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black')

In [None]:
ax = dinodata[dinodata['dataset']=='dino'].plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black')
dinodata[dinodata['dataset']=='circle'].plot(x='x',
       y='y',
       kind='scatter',
       figsize=(5,5),
       xlabel='hdata',
       ylabel='vdata',
       color='black',
       ax=ax)

# End