# Pandas fundamentals

**Credits**: Based on the [_Python Data Science Handbook_ by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)

## Series object

Pandas **Series** is a one-dimensional array of indexed data. It can be created from a list or array as follows

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

*Series* wraps both a *sequence of values* and a *sequence of indices*. The values are simply a familiar NumPy array:

In [None]:
data.values

The index is an array-like object of type `pd.Index`

In [None]:
data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation.
However, as we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates

In [None]:
data[1]

In [None]:
data[1:3]

### Series as a generalized NumPy array


In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['mau', 'medio', 'bom', 'cool'])
data

In [None]:
data['medio']

In [None]:
# Será que funciona? 
data['medio':'cool']

### Series as specialized dictionary

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
population['California']

## The Pandas DataFrame Object

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
states = pd.DataFrame({'population': population, 'area': area})
states

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [None]:
states.index

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [None]:
states.columns

We can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

In [None]:
states['area']

### Constructing DataFrame objects
A Pandas `DataFrame` can be constructed in a variety of ways. Here we'll give several examples.

From a single Series object:

In [None]:
pd.DataFrame(population, columns=['population'])

From a dictionary of Series objects:

In [None]:
pd.DataFrame({'population': population, 'area': area})

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

From a two-dimensional NumPy array

In [None]:
data = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])
data

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

In [None]:
data.values

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [None]:
data.T

## The Pandas Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

In [None]:
ind[1]

In [None]:
ind[::2]

One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

In [None]:
# Do your guess what is the result ?
ind[1] = 0

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

# Data Indexing and Selection

### Series

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['mau', 'medio', 'bom', 'cool'])
data

In [None]:
'medio' in data

In [None]:
data.keys()

In [None]:
list(data.items())

In [None]:
data["bom"] = 0.76

In [None]:
# slicing by explicit index
data['mau':'bom']

In [None]:
# slicing by implicit integer index
data[0:2]

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['bom', 'mau']]

### Indexers: loc, iloc, and ix

slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.


In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. First, the loc attribute allows indexing and slicing that always references the explicit index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

### Data Selection in DataFrame

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

In [None]:
data['area']

In [None]:
data['density'] = data['pop'] / data['area']
data

In [None]:
data['density']["California":"New York"]

### Important indexing conventions

First, while indexing refers to columns, slicing refers to rows:

In [None]:
data['density']

In [None]:
data['California':'New York']

Such slices can also refer to rows by number rather than by index:

In [None]:
data[0:3]

Similarly, **direct masking operations are also interpreted row-wise** rather than column-wise:

In [None]:
data[data.density > 100]

using the `loc` indexer we can index the underlying data in an array-like style but using the explicit index and column names:


In [None]:
data.loc[data["density"] > 100]

In [None]:
data.loc[data["density"] > 100, ['area','pop']]

In [None]:
data.loc[:'New York', :'pop']

In [None]:
data.iloc[:3, :2]

# Reading and writing data 
Let's create a DataFrame from scratch...

In [None]:
import pandas as pd
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)
purchases

The **Index** of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame. 

Let's have customer names as our index: 

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases

In [None]:
purchases["apples"]

We can **loc**ate a customer's order by using their name:

In [None]:
purchases.loc["Robert"]

### Writing to CSV, JSON and SQL files

It’s quite simple to save and load data from various file formats into a DataFrame.

In [None]:
purchases.to_csv('dados/purchases.csv')

In [None]:
purchases.to_json('dados/purchases.json')

If you’re working with data from a SQL database you need to first establish a connection using an appropriate Python library, then pass a query to pandas. Here we'll use SQLite to demonstrate. 

In [None]:
import sqlite3
con = sqlite3.connect("dados/purchases.sqlite3")
purchases.to_sql('purchases', con)

### Reading from CSV files

In [None]:
df = pd.read_csv('dados/purchases.csv')
df

CSVs don't have indexes like our DataFrames, so all we need to do is just designate the `index_col` when reading:

In [None]:
df = pd.read_csv('dados/purchases.csv', index_col=0)
df

If you have a JSON file — which is essentially a stored Python `dict` — pandas can read this just as easily:

### Reading data from JSON

If you have a JSON file — which is essentially a stored Python `dict` — pandas can read this just as easily:

In [None]:
df = pd.read_json('dados/purchases.json')
df

### Reading data from a SQL database

In [None]:
import sqlite3
con = sqlite3.connect("dados/purchases.sqlite3")
df = pd.read_sql_query("SELECT * FROM purchases", con) # index_col='index'
df

Just like with CSVs, we could pass index_col='index', but we can also set an index after-the-fact:

In [None]:
df = df.set_index('index')
df

# Operating on Data in Pandas

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'population':pop})
data

In [None]:
np.sqrt(data["area"])

In [None]:
np.sqrt(data)

In [None]:
np.sum(data["area"])

In [None]:
population / area

### Index alignment in Series

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
population / area

In [None]:
area.index | population.index

# Handling Missing Data

In [None]:
# NEXT WEEK ...

In [None]:
# IN THE MEANWHILE LET'S see a Practical Example