# Pandas fundamentals

**Credits**: Based on the [_Python Data Science Handbook_ by Jake VanderPlas](https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf)

## Series object

Pandas **Series** is a one-dimensional array of indexed data. It can be created from a list or array as follows

In [28]:
import numpy as np
import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

*Series* wraps both a *sequence of values* and a *sequence of indices*. The values are simply a familiar NumPy array:

In [29]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The index is an array-like object of type `pd.Index`

In [30]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation.
However, as we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates

In [31]:
data[1]

0.5

In [32]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as a generalized NumPy array


In [41]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['mau', 'medio', 'bom', 'cool'])
data

mau      0.25
medio    0.50
bom      0.75
cool     1.00
dtype: float64

In [46]:
data['medio']

0.5

In [45]:
# Será que funciona? 
data['medio':'cool']

medio    0.50
bom      0.75
cool     1.00
dtype: float64

### Series as specialized dictionary

In [47]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [48]:
population['California']

38332521

# Reading data and writing data from/to files 

In [3]:
import pandas as pd
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)
purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


The **Index** of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame. 

Let's have customer names as our index: 

In [4]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


In [7]:
purchases["apples"]

June      3
Robert    2
Lily      0
David     1
Name: apples, dtype: int64

In [12]:
# We can **loc**ate a customer's order by using their name:
purchases.loc["Robert"]

apples     2
oranges    3
Name: Robert, dtype: int64

## Writing to CSV, JSON and SQL files

It’s quite simple to save and load data from various file formats into a DataFrame.

In [18]:
purchases.to_csv('data/purchases.csv')
purchases.to_json('data/purchases.json')

In [19]:
import sqlite3
con = sqlite3.connect("data/purchases.sqlite3")
purchases.to_sql('purchases', con)

ValueError: Table 'purchases' already exists.