# Pandas 🐼

is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought of as **enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.**
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but **nearly everything that follows will require an understanding of what these structures are**.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

We will start our code sessions with the standard NumPy and Pandas imports:

In [3]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [33]:
# Alturas en clase
data = pd.Series([1.5, 1.6, 1.75, 1.80])
data

0    1.50
1    1.60
2    1.75
3    1.80
dtype: float64

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [34]:
data.values

array([1.5 , 1.6 , 1.75, 1.8 ])

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [35]:
type(data)

pandas.core.series.Series

In [36]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [37]:
type(data)

pandas.core.series.Series

In [38]:
type(data.values)

numpy.ndarray

In [39]:
data[0] # por defecto va a acceder a los values

1.5

In [40]:
# data[1:2] aunque sea un elemento, da un series.
data[0:1]

0    1.5
dtype: float64

In [41]:
data[:-1] # start, stop, step

0    1.50
1    1.60
2    1.75
dtype: float64

In [59]:
data.index

RangeIndex(start=0, stop=4, step=1)

As we will see, though, **the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy** array that it emulates.

### ``Series`` as generalized NumPy array

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
**The essential difference is the presence of the index**: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an ***explicitly defined*** index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, **if we wish, we can use strings as an index:**

In [43]:
data_nombre = pd.Series([1.5, 1.6, 1.75, 1.80],
                 index=['Jane', 'Joe', 'Susan', 'Mike'])
data_nombre

Jane     1.50
Joe      1.60
Susan    1.75
Mike     1.80
dtype: float64

In [44]:
data_nombre.values

array([1.5 , 1.6 , 1.75, 1.8 ])

In [45]:
data_nombre.index

Index(['Jane', 'Joe', 'Susan', 'Mike'], dtype='object')

And the item access works as expected:

In [46]:
print(data_nombre['Susan']) # Indexing por nombre
print(data_nombre[2]) # Si el index es string, accede por orden en el index
print(data_nombre[0:2]) # Slicing por orden. Ultimo SIN INCLUIR
# El slicing con numeros siempre es por orden, ultimo sin incluir.

1.75
1.75
Jane    1.5
Joe     1.6
dtype: float64


In [58]:
data_nombre.index[2]

'Susan'

In [54]:
data_nombre[1]
data_nombre["Mike"]

1.8

In [62]:
data_nombre

Jane     1.50
Joe      1.60
Susan    1.75
Mike     1.80
dtype: float64

In [61]:
data_nombre[1:3]

Joe      1.60
Susan    1.75
dtype: float64

In [27]:
print(data_nombre['Jane':'Susan']) # Si Index es texto, slicing por texto. 
#Ultimo INCLUIDO

Jane     1.50
Joe      1.60
Susan    1.75
dtype: float64


We can even use non-contiguous or non-sequential indices:

In [64]:
data = pd.Series([1.5, 1.6, 1.75, 1.80],
                 index=[2, 5, 3, 7])
data

2    1.50
5    1.60
3    1.75
7    1.80
dtype: float64

In [29]:
print(data[5]) # Indexing por nombre, no por orden
print(data[0:2]) # Slicing normal por ORDEN

1.6
2    1.5
5    1.6
dtype: float64


In [68]:
data.values[-1]

1.8

In [None]:
# slice siempre nos devuelve un objeto igual al que le hemos hecho el slice
# si hacemos slice a una Serie nos devolverá una Serie (array(values) + indice)

### Series as specialized dictionary

In this way, you can think of a **Pandas ``Series`` a bit like a specialization of a Python dictionary.**
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [69]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [70]:
population.values

array([38332521, 26448193, 19651127, 19552860, 12882135], dtype=int64)

In [71]:
population.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

By default, a ``Series`` will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:

In [73]:
population['California'] # Indexing por clave

38332521

In [77]:
population[0]

38332521

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [78]:
population['California':'Florida'] # Slicing por clave, incluye el ultimo

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

We'll discuss some of the quirks of Pandas indexing and slicing in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [23]:
# Si no decimos nada, añade un RangeIndex
# Los Index no tienen por que ser unicos, puede haber repetidos.
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [24]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [96]:
otra = pd.Series({"Illinois":'a', 1:'b', 3:365})
otra

Illinois      a
1             b
3           365
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [101]:
# Filtrar los index en la declaración
pd.Series({2:'a', 1:'b', 3:'c'}, index=[1, 2, 3])

1    b
2    a
3    c
dtype: object

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be **thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.**
We'll now take a look at each of these perspectives.

### DataFrame as a generalized NumPy array
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a **``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.**
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that **they share the same index.**

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [126]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [103]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [128]:
otra_population = [12,34,56,78,90]

In [129]:
states = pd.DataFrame({'population2': population,
                       'area2': area,
                      'otra': otra_population})

# los valores del diccionario no necesariamente tienen que ser Series 
# sino que podrán ser arrays, listas, Series
states

Unnamed: 0,population2,area2,otra
California,38332521,423967,12
Texas,26448193,695662,34
New York,19651127,141297,56
Florida,19552860,170312,78
Illinois,12882135,149995,90


In [118]:
states.otra.values

array([12, 34, 56, 78, 90], dtype=int64)

Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [38]:
states.index # index común a todas las columnas, porque es el del dataframe

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [121]:
states.columns

Index(['population2', 'area2', 'otra'], dtype='object')

**Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.**

### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [130]:
states.columns

Index(['population2', 'area2', 'otra'], dtype='object')

In [133]:
# de esta forma accedemos a la serie de ese dataframe
states.area2

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area2, dtype: int64

In [136]:
# de esta forma hacemos indexing por nombre al dataframe 
states["area2"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area2, dtype: int64

In [137]:
states[["area2", "population2"]]

Unnamed: 0,area2,population2
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [139]:
states.columns

Index(['population2', 'area2', 'otra'], dtype='object')

In [143]:
states.columns[0:2]

Index(['population2', 'area2'], dtype='object')

In [142]:
# Error ya que no hay ninguna columna que sea 0
states[states.columns[0:2]]

Unnamed: 0,population2,area2
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


**Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.**
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.
We'll explore more flexible means of indexing ``DataFrame``s in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

In [148]:
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [151]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [150]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [34]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
#data = {'a': [0,1,2], 'b': [0,2,4]}
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [158]:
data.shape

(4,)

Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [35]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [152]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [153]:
np.random.rand(3, 2)

array([[0.37301692, 0.63668363],
       [0.51549872, 0.58149783],
       [0.93687346, 0.97664135]])

In [156]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.43315,0.214932
b,0.413857,0.778165
c,0.696234,0.15125


## The Pandas Index Object

We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data.
This ``Index`` object is an interesting structure in itself, and **it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).**
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:

In [157]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

### Index as immutable array

The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

In [42]:
ind[1]

3

In [43]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

``Index`` objects also have many of the attributes familiar from NumPy arrays:

In [44]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between ``Index`` objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

In [46]:
ind[1] = 0

TypeError: Index does not support mutable operations

**This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.**