# Learn Pandas
> "Learning pandas"
- toc: false
- image: /images/covidblue.png
- author: "Dwight Gunning"
- branch: master
- badges: true
- comments: true

## Getting Started

Pandas is a python library that you install using **pip**

`
pip install pandas
`

or using **conda**

`
conda install pandas
`
Most users should install the latest version. For larger data projects it is often useful to pin to a specific pandas version to ensure API stability.

In [1]:
import pandas as pd

## Key Ideas
There are two primary objects - **DataFrame** and **Series**. Most commonly, you will work with dataframes, which are similar to excel worksheets or database tables. You can think of a DataFrame as a rectangular data structure made up of a number of columns. Each of the columns would then be a **Series**.

Every dataframe or series has an **Index** - which is a special column. If you create a dataframe without specifying an Index, pandas will create a numeric index of `range(0, len(df))`. Series in the same dataframe share the same index.

Let's create three lists that contain information about cats 

In [41]:
names = ['Lucy', 'Bella','Lucy','Nala']
ages = [2,2,3,5]
colors = ['gray', 'tabby', 'white', 'black']

Now we create a **dataframe** where each of the lists become a *column*, or in pandas terminology - a **series**.

In [43]:
df = pd.DataFrame({'name': names, 'age': ages, 'color':colors})
df

Unnamed: 0,name,age,color
0,Lucy,2,gray
1,Bella,2,tabby
2,Lucy,3,white
3,Nala,5,black


#### Values
The 3 individual data columns - name, age and color have been combined into a multidimensional array called `values`. 

In [48]:
df.values

array([['Lucy', 2, 'gray'],
       ['Bella', 2, 'tabby'],
       ['Lucy', 3, 'white'],
       ['Nala', 5, 'black']], dtype=object)

This is a numpy array with 4 rows and 3 columns

In [50]:
type(df.values)

numpy.ndarray

You can see the shape of the dataframe by calling `df.shape`. This delegates to a call to the numpy array's shape. Generally, operations on a pandas dataframe result in operations on the underlying numpy array.

In [51]:
df.shape, df.values.shape

((4, 3), (4, 3))

## Columns
You can see the list of columns in the dataframe by calling `df.columns`. The result for our dataframe are the names we specified as the keys to the dictionary we passed in when we created the dataframe. 

In [26]:
df.columns

Index(['name', 'age', 'color'], dtype='object')

### Selecting columns

A column can be accessed by passing the column name into the bracket operator e.g. using `df[col]`. So to get the **color** column, use `df['color']`

In [54]:
df['color']

0     gray
1    tabby
2    white
3    black
Name: color, dtype: object

And to get the **age** column use `df['age']`. Notice that both the **color** and **age** column include the index values `[0,1,2,3,4]` - because all columns in a dataframe share the same index.

In [55]:
df['age']

0    2
1    2
2    3
3    5
Name: age, dtype: int64

When you select a single column from a Dataframe you get a **series**. If you want to get a new dataframe with just one column use a column list with one item e.g. `df[['age']]` instead of `df['age']`

In [60]:
df[['age']]

Unnamed: 0,age
0,2
1,2
2,3
3,5


### Selecting columns - as a property
When a dataframe is created, it's columns are exposed as properties of that dataframe object, so it's possible to use the dot `.` operator

In [61]:
df.age

0    2
1    2
2    3
3    5
Name: age, dtype: int64

### Selecting multiple columns
To select multiple columns pass in a list with the names of the columns. 

In [57]:
df[['age', 'color']]

Unnamed: 0,age,color
0,2,gray
1,2,tabby
2,3,white
3,5,black


Notice the double brackets `[[]]`? The outer brackets are used for selections and the inner brackets are the list of columns. The code above is equivalent to this

In [58]:
age_and_color = ['age', 'color']
df[age_and_color]

Unnamed: 0,age,color
0,2,gray
1,2,tabby
2,3,white
3,5,black


When you select multiple columns from a dataframe, you get new dataframe with just those columns.

### Adding columns
To add a new column use the same bracket access operator `[]` as for selecting a dataframe

In [67]:
df['weight'] = [1.0,0.5,2.0,2.0]
df

Unnamed: 0,name,age,color,weight,d
0,Lucy,2,gray,1.0,a
1,Bella,2,tabby,0.5,b
2,Lucy,3,white,2.0,c
3,Nala,5,black,2.0,d


### Droping columns
To drop columns pass the names of columns to be dropped in a list

In [65]:
df.drop(columns=['weight'])

Unnamed: 0,name,age,color
0,Lucy,2,gray
1,Bella,2,tabby
2,Lucy,3,white
3,Nala,5,black


## Indexes
An absolute beginner to pandas may not need to know about indexes, especially since pandas creates indexes autmatically for you. However considers indexes to be important for its internal operations, and as your use of pandas increases you will eventually bump into index issues. 

#### set_index
You can change the column used for the index using `df.set_index`, passing in the column(s) to use. Here we tell the dataframe to promote the column **name** to be the index. 

In [24]:
df.index

RangeIndex(start=0, stop=4, step=1)

#### Index
Since we did not specify an index Pandas automatically created an `index` for us. Automatic indexes are always a `RangeIndex` from `0-len(df)`

In [32]:
df.set_index('name')

Unnamed: 0_level_0,age,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Lucy,3,blue
Bella,4,red
Lucy,2,blue
Nala,3,blue


You can set multiple columns as the index. Here we tell the dataframe to promote **name**, and **age** to be used as the index.

In [39]:
df2 = df.set_index(['name', 'age'])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,color
name,age,Unnamed: 2_level_1
Lucy,3,blue
Bella,4,red
Lucy,2,blue
Nala,3,blue


#### reset_index
The opposite of `set_index` - which promotes a column is `reset_index`, which demotes the index to be a regular column.

`df.reset_index(drop=True)`

In [37]:
df2 = df.set_index(['name', 'age'])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,color
name,age,Unnamed: 2_level_1
Lucy,3,blue
Bella,4,red
Lucy,2,blue
Nala,3,blue


## Note: Pandas operations return copies
Most pandas operations like `set_index` returns a copy of the dataframe. If you want to set the index on the original object use `df.set_index('<col>', inplace=True)`.

Knowing about pandas copying and how to trigger and work with copies.

#### reset_index

## Cheetsheets
#### Pandas Cheet Sheet - Dataquest

#### Pandas for Data Science - Datacamp

#### Data Wrangling in Pandas

## Reading Data

**pandas** can read rectangular data from almost anything. Here is a list - or rather, a dataframe containing all the pands `read_*` functions.

In [11]:
pd.DataFrame(data=[f for f in dir(pd) if f.startswith('read_')],
             columns=['function'])

Unnamed: 0,function
0,read_clipboard
1,read_csv
2,read_excel
3,read_feather
4,read_fwf
5,read_gbq
6,read_hdf
7,read_html
8,read_json
9,read_orc


### Reading CSVs

Most likely, you will be using `read_csv`. You can read csv data from a file.

In [34]:
# data = pd.read_csv('')