# Getting Started with pandas

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## 1. Introduction to pandas Data Structures

## 1.1 Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index.

In [2]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

In [3]:
obj.array

<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
obj = pd.Series([4, -7, 5, 3], index=["a", "b", "c", "d"])
obj

a    4
b   -7
c    5
d    3
dtype: int64

In [6]:
obj["a"]

4

In [7]:
obj[["a", "c"]]

a    4
c    5
dtype: int64

In [8]:
obj[obj > 0]

a    4
c    5
d    3
dtype: int64

In [9]:
obj * 2

a     8
b   -14
c    10
d     6
dtype: int64

In [10]:
np.exp(obj)

a     54.598150
b      0.000912
c    148.413159
d     20.085537
dtype: float64

In [11]:
"b" in obj

True

In [12]:
"e" in obj

False

In [13]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj2 = Series(sdata)
obj2

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [14]:
obj2.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's keys method, which depends on the key insertion order. You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [18]:
states = ["Ohio", "Texas", "Oregon", "Utah"]
sorted_states = sorted(states)

In [19]:
obj3 = Series(sdata, index=sorted_states)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [20]:
obj4 = Series(sdata, index=["California", "Ohio", "Texas", "Oregon"])
obj4

California        NaN
Ohio          35000.0
Texas         71000.0
Oregon        16000.0
dtype: float64

In [24]:
pd.isna(obj4) # pandas is NaN method

California     True
Ohio          False
Texas         False
Oregon        False
dtype: bool

In [25]:
pd.notna(obj4) # pandas not NaN method

California    False
Ohio           True
Texas          True
Oregon         True
dtype: bool

In [27]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [28]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Texas         71000.0
Oregon        16000.0
Name: population, dtype: float64

In [29]:
obj

a    4
b   -7
c    5
d    3
dtype: int64

In [30]:
obj.index = list(range(4))
obj

0    4
1   -7
2    5
3    3
dtype: int64

## 1.2 DataFrame

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [31]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [32]:
frame.head() # Selects the first 5 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [33]:
frame.tail() # Selects the last 5 rows

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [36]:
pd.DataFrame(data, columns=["year", "pop", "state"])

Unnamed: 0,year,pop,state
0,2000,1.5,Ohio
1,2001,1.7,Ohio
2,2002,3.6,Ohio
3,2001,2.4,Nevada
4,2002,2.9,Nevada
5,2003,3.2,Nevada


In [45]:
frame2 = pd.DataFrame(data, columns=["state", "year", "pop", "debt"])
frame2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,
1,Ohio,2001,1.7,
2,Ohio,2002,3.6,
3,Nevada,2001,2.4,
4,Nevada,2002,2.9,
5,Nevada,2003,3.2,


In [38]:
frame["year"] # Retrive columns

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Note<br>
Attribute-like access (e.g., frame2.year) and tab completion of column names in IPython are provided as a convenience.

frame2[column] works for any column name, but frame2.column works only when the column name is a valid Python variable name and does not conflict with any of the method names in DataFrame. For example, if a column's name contains whitespace or symbols other than underscores, it cannot be accessed with the dot attribute method.

In [41]:
frame.loc[0] # Retrive rows

state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object

In [43]:
frame.iloc[0]

state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object

In [46]:
frame2["debt"] = 16.5
frame2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,16.5
2,Ohio,2002,3.6,16.5
3,Nevada,2001,2.4,16.5
4,Nevada,2002,2.9,16.5
5,Nevada,2003,3.2,16.5


In [47]:
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,0.0
1,Ohio,2001,1.7,1.0
2,Ohio,2002,3.6,2.0
3,Nevada,2001,2.4,3.0
4,Nevada,2002,2.9,4.0
5,Nevada,2003,3.2,5.0


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present:

In [48]:
val = pd.Series([-1.2, -1.5, -1.7], index=[2, 4, 5])

frame2["debt"] = val

frame2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,
1,Ohio,2001,1.7,
2,Ohio,2002,3.6,-1.2
3,Nevada,2001,2.4,
4,Nevada,2002,2.9,-1.5
5,Nevada,2003,3.2,-1.7


In [51]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,state,year,pop,debt,eastern
0,Ohio,2000,1.5,,True
1,Ohio,2001,1.7,,True
2,Ohio,2002,3.6,-1.2,True
3,Nevada,2001,2.4,,False
4,Nevada,2002,2.9,-1.5,False
5,Nevada,2003,3.2,-1.7,False


Caution:<br>
New columns cannot be created with the frame2.eastern dot attribute notation.

In [52]:
del frame2["eastern"]
frame2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,
1,Ohio,2001,1.7,
2,Ohio,2002,3.6,-1.2
3,Nevada,2001,2.4,
4,Nevada,2002,2.9,-1.5
5,Nevada,2003,3.2,-1.7


In [53]:
frame2.columns

Index(['state', 'year', 'pop', 'debt'], dtype='object')

Caution:<br>
The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.

If the nested dictionary is passed to the DataFrame, pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices:

In [55]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6}, "Nevada": {2001: 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [56]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Warning<br>
Note that transposing discards the column data types if the columns do not all have the same data type, so transposing and then transposing back may lose the previous type information. The columns become arrays of pure Python objects in this case.

In [57]:
pdata = {"Ohio": frame3["Ohio"][:-1], "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


<h2>Possible Data Inputs to the DataFrame Constructor</h2>

<table>
    <tr>
        <th>Type</th>
        <th>Notes</th>
    </tr>
    <tr>
        <td>2D ndarray</td>
        <td>A matrix of data, passing optional row and column labels</td>
    </tr>
    <tr>
        <td>Dictionary of arrays, lists, or tuples</td>
        <td>Each sequence becomes a column in the DataFrame; all sequences must be the same length</td>
    </tr>
    <tr>
        <td>NumPy structured/record array</td>
        <td>Treated as the “dictionary of arrays” case</td>
    </tr>
    <tr>
        <td>Dictionary of Series</td>
        <td>Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed</td>
    </tr>
    <tr>
        <td>Dictionary of dictionaries</td>
        <td>Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case</td>
    </tr>
    <tr>
        <td>List of dictionaries or Series</td>
        <td>Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels</td>
    </tr>
    <tr>
        <td>List of lists or tuples</td>
        <td>Treated as the “2D ndarray” case</td>
    </tr>
    <tr>
        <td>Another DataFrame</td>
        <td>The DataFrame’s indexes are used unless different ones are passed</td>
    </tr>
    <tr>
        <td>NumPy MaskedArray</td>
        <td>Like the “2D ndarray” case except masked values are missing in the DataFrame result</td>
    </tr>
</table>

In [58]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [59]:
frame3.index.name = "year"
frame3

Unnamed: 0_level_0,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [60]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [61]:
frame2.to_numpy()

array([['Ohio', 2000, 1.5, nan],
       ['Ohio', 2001, 1.7, nan],
       ['Ohio', 2002, 3.6, -1.2],
       ['Nevada', 2001, 2.4, nan],
       ['Nevada', 2002, 2.9, -1.5],
       ['Nevada', 2003, 3.2, -1.7]], dtype=object)

## 1.3 Index Objects

pandas’s Index objects are responsible for holding the axis labels (including a DataFrame's column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [64]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [65]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [70]:
try:
    index[1] = "d"
except TypeError as err:
    print(err)

Index does not support mutable operations


In [72]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [73]:
obj.index = [0, 1, 2]

In [78]:
obj.index is labels

False

In [79]:
obj.index = labels

In [80]:
obj.index is labels

True

Unlike Python sets, a pandas Index can contain duplicate labels:

In [81]:
pd.Index(["foo", "foo", "bar", "bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

<h2>Some Index Methods and Properties</h2>

<table>
    <tr>
        <th><code>Method/Property</code></th>
        <th>Description</th>
    </tr>
    <tr>
        <td><code>append()</code></td>
        <td>Concatenate with additional Index objects, producing a new Index</td>
    </tr>
    <tr>
        <td><code>difference()</code></td>
        <td>Compute set difference as an Index</td>
    </tr>
    <tr>
        <td><code>intersection()</code></td>
        <td>Compute set intersection</td>
    </tr>
    <tr>
        <td><code>union()</code></td>
        <td>Compute set union</td>
    </tr>
    <tr>
        <td><code>isin()</code></td>
        <td>Compute Boolean array indicating whether each value is contained in the passed collection</td>
    </tr>
    <tr>
        <td><code>delete()</code></td>
        <td>Compute new Index with element at Index i deleted</td>
    </tr>
    <tr>
        <td><code>drop()</code></td>
        <td>Compute new Index by deleting passed values</td>
    </tr>
    <tr>
        <td><code>insert()</code></td>
        <td>Compute new Index by inserting element at Index i</td>
    </tr>
    <tr>
        <td><code>is_monotonic</code></td>
        <td>Returns True if each element is greater than or equal to the previous element</td>
    </tr>
    <tr>
        <td><code>is_unique</code></td>
        <td>Returns True if the Index has no duplicate values</td>
    </tr>
    <tr>
        <td><code>unique()</code></td>
        <td>Compute the array of unique values in the Index</td>
    </tr>
</table>