# Pandas

[Pandas](https://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It was developed mainly for working with relational or labeled data and provides two data structures, Series and DataFrame, for manipulating data.

Pandas Series is a one-dimensional labeled array that can store data of any type and has axis labels called indexes. Pandas DataFrame is a two-dimensional array (rows and columns) that is mutable and can store different data types.

Use `import` to load pandas using `pd` as an alias.

In [1]:
import pandas as pd

Create three lists and a dictionary `d`.

In [2]:
col1 = list(range(1,6))
col2 = list(range(7, 12))
col3 = list(range(13, 18))

d = {
    'col1' : col1,
    'col2' : col2,
    'col3' : col3
}

print(type(d))

<class 'dict'>


Create a data frame `df`.

In [3]:
df = pd.DataFrame(d)
print(type(df))
print(df)

<class 'pandas.core.frame.DataFrame'>
   col1  col2  col3
0     1     7    13
1     2     8    14
2     3     9    15
3     4    10    16
4     5    11    17


Name the rows. (Rows are called index and columns are called columns.)

In [4]:
df.index = ['one', 'two', 'three', 'four', 'five']
print(df)

       col1  col2  col3
one       1     7    13
two       2     8    14
three     3     9    15
four      4    10    16
five      5    11    17


Read from a CSV file and print first three rows.

In [5]:
iris = pd.read_csv("../data/iris.csv")
print(iris[0:3])

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa


Use [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to access a group of rows and columns using labels or a boolean array. The example below uses the row names to access specific rows.

In [6]:
print(iris.loc[[1, 3, 5]])

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           4.9          3.0           1.4          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa


Access specific rows and columns.

In [7]:
print(
    iris.loc[
        range(0,3),
        ['Sepal.Width', 'Sepal.Length']
    ]
)

   Sepal.Width  Sepal.Length
0          3.5           5.1
1          3.0           4.9
2          3.2           4.7


Use `iloc` to access rows and columns by integer position(s).

In [8]:
print(
    iris.iloc[
        [0, 1, 2],
        [1, 0]
    ]
)

   Sepal.Width  Sepal.Length
0          3.5           5.1
1          3.0           4.9
2          3.2           4.7


Note that there is a difference between a single square brackets (pandas Series) and a double square brackets (pandas DataFrame) when accessing columns. A pandas Series is necessary for subsetting a pandas DataFrame.

In [9]:
print(type(iris["Sepal.Length"]))
print(iris["Sepal.Length"])

<class 'pandas.core.series.Series'>
0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: Sepal.Length, Length: 150, dtype: float64


DataFrame.

In [10]:
print(type(iris[["Sepal.Length"]]))
print(iris[["Sepal.Length"]])

<class 'pandas.core.frame.DataFrame'>
     Sepal.Length
0             5.1
1             4.9
2             4.7
3             4.6
4             5.0
..            ...
145           6.7
146           6.3
147           6.5
148           6.2
149           5.9

[150 rows x 1 columns]


Subsetting using a Series.

In [11]:
long_sepal = iris["Sepal.Length"] > 7.5
print(type(long_sepal))
print(iris[long_sepal])

<class 'pandas.core.series.Series'>
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
105           7.6          3.0           6.6          2.1  virginica
117           7.7          3.8           6.7          2.2  virginica
118           7.7          2.6           6.9          2.3  virginica
122           7.7          2.8           6.7          2.0  virginica
131           7.9          3.8           6.4          2.0  virginica
135           7.7          3.0           6.1          2.3  virginica


Since pandas is built on NumPy, we can also use NumPy booleans for subsetting.

In [12]:
import numpy as np

wanted = np.logical_and(
    iris["Sepal.Length"] > 7.5,
    iris["Petal.Length"] > 6.8
)

print(iris[wanted])

     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
118           7.7          2.6           6.9          2.3  virginica


Use [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to apply a function along an axis of a DataFrame. The example below applys the [str.upper](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.upper.html) function (convert strings to uppercase) to the `Species` Series and creates a new column called `SPECIES`.

In [13]:
iris["SPECIES"] = iris["Species"].apply(str.upper)
iris.loc[[0, 1, 2], ["Species", "SPECIES"]]

Unnamed: 0,Species,SPECIES
0,setosa,SETOSA
1,setosa,SETOSA
2,setosa,SETOSA
