# Pandas Data Structures

## Introduction

Managing datasets using the Python Standard Library can be time consuming as there are no specific tools for the main opeartions needed to perform data wrangling and data munging. There are some packages that are very useful (i.e. `csv`, `pickle`, etc.) but they are not toghether under a same syntaxis or API, so the use may be using several libraries to tackle one single problem.

### What is Pandas?

Pandas stands for "Python Data Analysis Library" and it incorporates most of the tools needed for working with datasets and tabular data. 

It has two main data structures names `DataFrame` and `Series` that represent two dimensional datasets and one dimensional vectors respectivelly, that can be seen as datasets and dataset's columns (variables).  

### What you will learn in this session?

* Which are the main data structures in the Pandas library
* How to create datasets using Pandas
* Accessing dataset values

## Installing Pandas

Pandas can be easily installed using `pip`

```bash
$ pip install pandas
```

Pandas is commonly imported using the following Python alias naming:

In [2]:
import pandas as pd
import numpy as np

## Contents

* [Series](#Series)
    * [Series Creation](#Series-Creation)
    * [Series Name](#Series-Name)
    * [Series Values Accessing](#Series-Values-Accessing)
    * [Series Operations](#Series-Operations)
    * [Series `dtypes`](#Series-dtypes)
    * [Series Methods](#Series-Methods)
    * [Examples with Series](#Examples-with-Series)
    * [Exercises with Series](#Exercises-with-Series)
* [DataFrames](#DataFrames)
    * [DataFrame Creation](#DataFrame-Creation)
    * [Column selection, addition, deletion](#Column-selection,-addition,-deletion)
    

## Series

Series is an one-dimension structures that can hold any data type (boolean, integer, float or even Python objects).

It can represent columns dataset. Its main attributes are index and dtype.

* **`values`** contains a `Numpy.array` with the `Series` values
* **`index`** allows to acces `Series` positions using integers or labels
* **`dtype`** defines the type of the objects contained in the series. Recall that columns in a dataset are always same tyep values
* **`size`** returns the length of the `Series` object

Some other functionaitiles are **vectorized operations** and `numpy.ndarray` method mapping.

There is a basic detail to keep in mind: **data alignment is intrinsic**. The link between labels and data will not be broken unless done so explicitly by you.

Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

### Series Creation

We can create Series from:
* [Python Dictionary](#From-dictionary)
* [Python List](#From-list)
* [NumPy `ndarray`](#From-numpy-array)
* [a scalar value]()

As optional parameters, we can provide an `index` and a `dtype`. If `dtype` not specified, this will be inferred from data.

The passed **index is a list of axis labels**. Indexes, this will provide an effective way to access data (slicing).

Each Series has a dtype that corresponds to the variable type of Series.

##### From dictionary

Basically we pass a dictionary to de constructor method. In this case if no index is provided, it is extracted from dictionary keys, while the data is extracted from dictionary values.

In [3]:
d = {'a':5.,'b':5.,'c':5.}
i = ['x','y','z']
s1 = pd.Series(d)
display(s1)
display(s1.index)

a    5.0
b    5.0
c    5.0
dtype: float64

Index(['a', 'b', 'c'], dtype='object')

If index is passed, values with keys in index are pulled out, the rest are assigned to `NaN`. 

A `NaN` value means not assigned, and we will have to deal with these values in the future.

**What is NaN?**

(*from [Pandas docs](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)*)
Some might quibble over our usage of missing. By “missing” we simply mean null or “not present for whatever reason”. 

Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. 

For example, in a collection of financial time series, some of the time series might start on different dates. 

Thus, values prior to the start date would generally be marked as missing.

In [6]:
d = {'a':5,'b':5,'c':5}
i = ['x','y','a','b']
s1 = pd.Series(d, index = i)

display(s1)
display(s1.dtype)
display(s1.index)

x    NaN
y    NaN
a    5.0
b    5.0
dtype: float64

dtype('float64')

Index(['x', 'y', 'a', 'b'], dtype='object')

###### From list

We can create series from lists, againg by passing a `list` as argument to the `Series` constructor method. 

If no index is pased, the index will be `RangeIndex(start=0, stop=len(list), step=1)`.

In [8]:
l = [5,5,5]
s1 = pd.Series(l)
display(s1)
display(s1.index)

0    5
1    5
2    5
dtype: int64

RangeIndex(start=0, stop=3, step=1)

In this case we can provide an index if desired. However, index must have the same length as constructor list.

In [9]:
i = ['x','y','z']
s1 = pd.Series(l, index = i)
display(s1)

x    5
y    5
z    5
dtype: int64

In [5]:
# This would raise an error
# i = ['x','y','a','b']
# s1 = pd.Series(l, index = i)
# print s1
display(s1.dtype)
display(s1.index)

dtype('float64')

Index(['x', 'y', 'a', 'b'], dtype='object')

##### From numpy array

Very similar to `list` case. It is useful as many Machine Learning libraries return `numpy.ndarray`.

In [34]:
s2 = pd.Series(np.array([3,20,5]),index=['a','b','c'])
display(s2)
display(s2.dtype)
display(s2.index)

a     3
b    20
c     5
dtype: int64

dtype('int64')

Index(['a', 'b', 'c'], dtype='object')

##### From scalar

We can use an scalar as an argument of the `Series` constructor method. Together with an `index` argument is an useful way of creating a `Series` object with a default value.

In [11]:
s3 = pd.Series(5,index=['a','b','c'])
display(s3)
display(s3.dtype)
display(s3.index)

a    5
b    5
c    5
dtype: int64

dtype('int64')

Index(['a', 'b', 'c'], dtype='object')

### Series Name
Series can have the attribute **name**. This can be seen as the name of the variable/colum

When dealing with DataFrames, Series names will be automatically assigned with its column name.

In [12]:
s3 = pd.Series(5,index=['a','b','c'], name = 'Age')
s3.name

'Age'

### Series Values Accessing

Series can be accessed through diferent methods:
* position (numerical index)
* boolean list
* key (axis of labels)

Accessing by position is like working with `numpy.ndarrays` or `list`, while accessing through keys (axis of labels) is like working with dictionaries.

##### Position accessing

Accessing `Series` values by its position is very similar to working with regular Python `list`.

Note that the result of this method is a `dtype` value in case we access to a single value or a if we select more than one it is a `Series` object.

In [13]:
display(s2)
display(type(s2))

a     3
b    20
c     5
dtype: int64

pandas.core.series.Series

In [14]:
display(s2[1])
display(type(s2[1]))

20

numpy.int64

In [15]:
display(s2[:1])
display(type(s2[:1]))

a    3
dtype: int64

pandas.core.series.Series

##### Boolean list accessing

This is a very powerful way of accessing values, first let's check how can we select `Series` values using a list of `Boolean` values. 

The list can be seen as a mask for the `Series` values, so it will return only the values where the position of the `Boolean` list is `True`. 

If the length of the `Boolean` list is no equal to the `Series.size` value, it will raise an `IndexError`.

In [18]:
s2[[True,True,False]]

a     3
b    20
dtype: int64

We will see later how **vectorized operations** work, but basically we can perform an operation to the whole Series.

If we use `Boolean` operations, these return `Boolean` lists that we can use to select certain values of the `Series`

In [20]:
# this is a vecotrized operation
boolean_results = s2>4
display(boolean_results)
display(s2[boolean_results])

a    False
b     True
c     True
dtype: bool

b    20
c     5
dtype: int64

Normally the auxiliar variable is not used leveraging a more compact syntax

In [21]:
s2[s2>4]

b    20
c     5
dtype: int64

##### Key accessing

We can access `Series` values by passing as parameter a `list` containing the keys to select.

In [15]:
s2[['a','b']]

a     3
b    20
dtype: int32

In [16]:
s2['a']

3

Very similar to position access, but now the key list we pass as parameter has to contain keys in the `Series` `index`, otherwise it will raise a `KeyError`

In [17]:
'a' in s2

True

In case of accessing an nonexistent key, a `KeyError` exception is thrown

In [19]:
try:
    s2['z']
except KeyError:
    print("Error handled")

Error handled


One way to avoid errors, we can use `Series.get` method, where a default value is returned in case of error.

In [22]:
display('x' in s2)
display(s2.get('x', np.NaN))

False

nan

### Series Operations

**Vectorized operations** can be done over pandas `Series` and also Series are accepted as most of NumPy operations. 

**Watch out:** The result of an operation between unaligned Series will have **the union of the indexes involved**. 

If a label is not found in one Series or the other, **the result will be marked as missing NaN**.

**Remember:** data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

In [22]:
s2

a     3
b    20
c     5
dtype: int32

The result is a `Series` object, resulting in a very functional programming syntax

In [23]:
s2 + 23

a    26
b    43
c    28
dtype: int64

In [24]:
# we can apply numpy functions
np.add(s2,23) == s2 + 23

a    True
b    True
c    True
dtype: bool

Note the index-wise operation

In [25]:
display(s1)
display(s2)
display(s1+s2)

x    5
y    5
z    5
dtype: int64

a     3
b    20
c     5
dtype: int64

a   NaN
b   NaN
c   NaN
x   NaN
y   NaN
z   NaN
dtype: float64

To solve this situation we have to work with `Series.reset_index()` method or replacing `index` attrbute.
* `Series.reset_index()`: the index is set to `RangeIndex(start=0, stop=len(Series.values), step=1)` while preserving the old index in a new column. Note that the result contains an `index`, a column with old index named index and the values in a column named 0. Thus, the result is a `DataFrame`

In [27]:
display(s2.reset_index())
display(type(s2.reset_index()))
display(s2.reset_index().index)

Unnamed: 0,index,0
0,a,3
1,b,20
2,c,5


pandas.core.frame.DataFrame

RangeIndex(start=0, stop=3, step=1)

In [42]:
s1.reset_index() + s2.reset_index()

Unnamed: 0,index,0
0,xa,8
1,yb,25
2,zc,10


* Replacing `Series.index` with a new `index`

In [43]:
s2.index = s1.index
s2 + s1

x     8
y    25
z    10
dtype: int64

### Series `dtypes`

`dtype` can be forced at `Series` creation, if `dtype` is `None`, it will be inferred. 

In [34]:
pd.Series([1,2,3,4,5], dtype=np.float32)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float32

For **numerical** variables, most common `dtypes` will be int and float.

#### Categorical `dtype` (from [Pandas User Guide: Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html))

For categorical variables strings are common types. 

Categoricals are a pandas data type corresponding to categorical variables in statistics. 

A categorical variable takes on a limited, and usually fixed, number of possible values. 

Examples are:
    * gender 
    * blood type
    * country affiliation
    * observation time

In contrast to statistical categorical variables, categorical data might have an order, some examples:
 * ‘strongly agree’ vs ‘agree’
 * 'first observation’ vs. ‘second observation’
 
However, numerical operations (additions, divisions, etc.) are not possible.

All values of categorical data are either in categories or np.nan. 

Order is defined by the order of categories, not lexical order of the values. 

Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

The categorical data type is useful in the following cases:

* A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
* The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
* As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

In [45]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In the examples above where we passed `dtype='category'`, we used the default behavior:

* Categories are inferred from the data.
* Categories are unordered.

To control those behaviors, instead of passing `category`, use an instance of `CategoricalDtype`.

In [47]:
from pandas.api.types import CategoricalDtype

s = pd.Series(["a", "b", "c", "a"])
display(s)

cat_type = CategoricalDtype(
    categories=["b", "c", "d"],
    ordered=True)

s_cat = s.astype(cat_type)

display(s_cat)

0    a
1    b
2    c
3    a
dtype: object

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b < c < d]

We can use `pandas.Categorical` object to a `Series`.

In [53]:
raw_cat = pd.Categorical(
    ["a", "b", "c", "a"], 
    categories=["b", "c", "d"],
    ordered=False)

s = pd.Series(raw_cat)
display(s)

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

**Why strings are dtype object?** (*from [StackOverflow thread](http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object)*)

The dtype object comes from `NumPy`, it describes the type of element in a `NumPy.ndarray`. 

Every element in a ndarray must has the same size in byte. For `int64` and `float64`, these are 8 bytes. 

However, for strings, the length of the string is not fixed. So instead of save the bytes of strings in the `NumPy.ndarray` directly, Pandas use `object` `ndarray`, which save pointers to objects, because of this the `dtype` of this kind `ndarray` is `object`.

In [35]:
pd.Series(["a","b","c"],dtype=str)

0    a
1    b
2    c
dtype: object

**Dates are categorical or numeric?**

Days of the month, months, days of the week, etc... are considered categorical.

Specific dates, such as the day payments are received, birth dates, etc... are numeric.

In [52]:
s_days1 = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], dtype="category")
display(s_days)
raw_cat = pd.Categorical(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 
    categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
   ordered=True)

s_days2 = pd.Series(raw_cat)
display(s_days2)

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: category
Categories (7, object): [Friday, Monday, Saturday, Sunday, Thursday, Tuesday, Wednesday]

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: category
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]

In [55]:
s_days1.cat.ordered

False

In [56]:
s_days2.cat.ordered

True

`pandas` contains extensive capabilities and features for working with time series data for all domains. 

Using the `NumPy.datetime64` and `timedelta64` dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.

For more info: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [59]:
import datetime

base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 7)]
date_s = pd.Series(date_list)
display(date_s)
display(date_s[1] > date_s[2])

0   2019-11-18 11:23:12.974935
1   2019-11-17 11:23:12.974935
2   2019-11-16 11:23:12.974935
3   2019-11-15 11:23:12.974935
4   2019-11-14 11:23:12.974935
5   2019-11-13 11:23:12.974935
6   2019-11-12 11:23:12.974935
dtype: datetime64[ns]

True

##### Missing Data

You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the `dtype`.

For example, numeric containers will always use `NaN` regardless of the missing value type chosen:

In [63]:
import numpy as np

s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan

display(s)

0    None
1     NaN
2       c
dtype: object

In [62]:
s = pd.Series([1, 2, 3])
s.loc[0] = None
s.loc[1] = np.nan

display(s)

0    NaN
1    NaN
2    3.0
dtype: float64

Missing values propagate naturally through arithmetic operations between pandas objects.

In [37]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([1,np.nan,3])

s1 + s2

0    2.0
1    NaN
2    6.0
dtype: float64

The descriptive statistics and computational methods are all written to account for missing data. For example:

* When summing data, NA (missing) values will be treated as zero
* If the data are all NA, the result will be NA
* Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays


##### Cleaning/filling missing data
    
pandas objects are equipped with various data manipulation methods for dealing with missing data.

* The `fillna()` method can “fill in” NA values with non-null data 

In [66]:
s2 = pd.Series([1,np.nan,3])
display(s2.fillna(0))

0    1.0
1    0.0
2    3.0
dtype: float64

* With `dropna()` you can simply exclude labels from a data set which refer to missing data

In [67]:
display(s2.dropna())

0    1.0
2    3.0
dtype: float64

* With `isnull()` and `notnull()` you can return a boolean array to select not assigned values

In [68]:
display(s2.isnull())

0    False
1     True
2    False
dtype: bool

In [69]:
s2[s2.isnull()]

1   NaN
dtype: float64

### Series Methods

You can check the full list of attributes and methods of the `Series` object [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

But, for now, let's take a look to some of them that may result interesting and very useful.

In [188]:
s = pd.Series(["A", "b", "c", "d", "a", "B", "B"])

##### `Series.head(n=5)`
It returns the first `n` elements of the `Series` object. By default `n=5`. An analogous method is `Series.tail()`

In [189]:
display(s.head(3))
display(s.tail(3))

0    A
1    b
2    c
dtype: object

4    a
5    B
6    B
dtype: object

##### `Series.unique()`
It returns the set of unique (no repeated) values as `NumPy,array`

In [202]:
s.unique()

array(['A', 'b', 'c', 'd', 'a', 'B'], dtype=object)

##### `Series.value_counts()`

It returns a frequency table. It has some interesting methods like `normalize` and `dropna`

In [74]:
display(s)
display(s.value_counts())

0    A
1    b
2    c
3    d
4    a
5    B
6    B
dtype: object

B    2
d    1
c    1
b    1
a    1
A    1
dtype: int64

###### `Series.describe()`
It returns descriptive statistics of the variable (`Series` object)

In [75]:
display(s)
display(s.describe())

0    A
1    b
2    c
3    d
4    a
5    B
6    B
dtype: object

count     7
unique    6
top       B
freq      2
dtype: object

###### `Series.map()`
It applies the function passed by parameter to the `Series` object.

In [76]:
s.map(lambda x: x.lower())

0    a
1    b
2    c
3    d
4    a
5    b
6    b
dtype: object

### Examples with Series

We can see some examples of how we can load datasets using `Series` and how to manipulate them.

In [176]:
import csv
import requests

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = requests.get(url)

if response.status_code == 200:
    data = list(csv.reader(response.content.decode().split("\n"))) # data is a list of list
    # warning, last row is empty
    data = data[:-1]
    print(type(data))
else:
    print("error requesting")

<class 'list'>


Taking a look at the `data` variable, we can see that it is a list of lists, containing list of rows. We have seen that `Series` map to columns, so first, let's transpose the dataset.

In [177]:
transposed_dataset = list(map(list, zip(*dataset)))

Fourth column is airport country

In [183]:
display(transposed_dataset[3][:5])
airport_countries = transposed_dataset[3]

['Papua New Guinea',
 'Papua New Guinea',
 'Papua New Guinea',
 'Papua New Guinea',
 'Papua New Guinea']

Now we can create a `Series` with this list

In [187]:
countries = pd.Series(airport_countries)
# this is a more interesting Series object than a list
countries.head(n=10)

0    Papua New Guinea
1    Papua New Guinea
2    Papua New Guinea
3    Papua New Guinea
4    Papua New Guinea
5    Papua New Guinea
6           Greenland
7           Greenland
8           Greenland
9           Greenland
dtype: object

In [184]:
display(countries.index)

RangeIndex(start=0, stop=5, step=1)

Access by index

In [190]:
countries[0]

'Papua New Guinea'

Access using boolean arrays

In [192]:
## we can evaluate a function over all elements to get a boolean series
countries == "Spain").head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [193]:
countries[countries == "Spain"]).head()

1028    Spain
1029    Spain
1030    Spain
1031    Spain
1032    Spain
dtype: object

In [194]:
countries[countries != "Spain"]).head()

0    Papua New Guinea
1    Papua New Guinea
2    Papua New Guinea
3    Papua New Guinea
4    Papua New Guinea
dtype: object

In [195]:
countries[countries != "Spain"].tail()

7693     Russia
7694     Russia
7695     Russia
7696      Chile
7697    Ukraine
dtype: object

Get the list of countries that have at least 1 airport

In [199]:
countries.unique()[:5]

array(['Papua New Guinea', 'Greenland', 'Iceland', 'Canada', 'Algeria'],
      dtype=object)

Note that, as keys are integers, accessing by key has the same behaviour

In [200]:
countries[81]

'Canada'

We can get the list of countries with more airports

In [204]:
countries.value_counts().head()

United States    1512
Canada            430
Australia         334
Brazil            264
Russia            264
dtype: int64

In [205]:
countries.value_counts(normalize=True).head()

United States    0.196415
Canada           0.055859
Australia        0.043388
Brazil           0.034295
Russia           0.034295
dtype: float64

### Exercises with Series

1. Load iqsize.csv using the `csv` library as a dictionary of Series.



2. Check Series `dtype`. Are they correct? Why? 



3. For sex variable, select those that are males (including synonyms). 



4. Count how many women and how many men there are. How many missing elements there are?

## DataFrame

A DataFrame is a 2-dimensional **labeled** data structure with columns of different types. 

It can be seen as a spreadsheet, where columns are `Series` or a Python dictionary where Series can be accessed through labels.

Note that now the indexing is two dimensional:
* `index` indexes rows
* `columns` indexes columns

`DataFrame` has the `dtypes` atribute, which is a `Series` object containing `column` `dtypes`

The shape of a `DataFrame` is stored in `DataFrame.shape` attribute, which is a tuple of ints: `(n_rows, n_cols)`

### DataFrame Creation

We can create DataFrames from:
* [Dict of Series](#From-dict-of-Series-or-dict)
* [Dict of `ndarrays` or lists](#From-dict-of-ndarrays-/-lists)
* [Structured or record ndarray](#From-structured-or-record-array)
* [List of Dicts](#From-a-list-of-dicts)
* [Another `DataFrame`](#From-another-DataFrame)

##### From dict of Series

We can use a dict as constructor of the `DataFrame` object.

Dict keys will be used as `column`names.

The result **index** will be the **union** of the indexes of the various Series. 

In [261]:
d = {'one': pd.Series([1,2,3], index=['a','b','c']),
     'two': pd.Series([1,2,3,4], index=['a','b','c','z']),
     'three':{'a':1}}
df = pd.DataFrame(d)
display(df.index)
display(df.columns)
display(df.dtypes)
display(df.shape)

Index(['a', 'b', 'c', 'z'], dtype='object')

Index(['one', 'two', 'three'], dtype='object')

one      float64
two        int64
three    float64
dtype: object

(4, 3)

In [260]:
display(df)

Unnamed: 0,one,two,three
a,1.0,1,1.0
b,2.0,2,
c,3.0,3,
z,,4,


If `index` is passed as parameter, only these index in the `Series` will be used to construct the `DataSet` 

In [44]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,three,two
d,,,
b,2.0,,2.0
a,1.0,1.0,1.0


The same behaviour works for columns.

In [45]:
pd.DataFrame(d, index=['d', 'b', 'a'], 
             columns=['two', 'three','four'])

Unnamed: 0,two,three,four
d,,,
b,2.0,,
a,1.0,1.0,


##### From dict of `ndarrays` or lists

The `ndarrays` must all be the same length. 

Dictionary `keys` will be used as `column` index

If an **index** is passed, it must be the same length as the arrays. 

If no **index** is passed, the result will be <code>range(n)</code>, where `n` is the array length.

In [213]:
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


**Important Note:** as in `Series`, operations are index aligned

In [214]:
d1 = pd.DataFrame(d) 
d2 = pd.DataFrame(d, index = ['a','b','c','d'])
d1+d2

Unnamed: 0,one,two
0,,
1,,
2,,
3,,
a,,
b,,
c,,
d,,


##### From structured or record array

We can use the creation of  `NumPy.ndarrays` with default `dtype`.

In [221]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In the example above we are creating an array with two rows and undefined columns (column number, name and `dtype` are specified with `dtype` parameter).

We pass a list of three tuples. Each tuple contains a name and `dtype` as string, meaning ([Arrays Interface](https://docs.scipy.org/doc/numpy/reference/arrays.interface.html#arrays-interface) for further reference):
* `i4`: 32 bit integer
* `f4`: 32 bit float
* `a10`: 10-length zero-terminated bytes

Then we fill the created `DataFrame` with values

In [223]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
display(data)

array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

Now we can construct the `DataFrame`

In [224]:
df = pd.DataFrame(data)
display(df)
display(df.dtypes)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


A      int32
B    float32
C     object
dtype: object

In this case we can set row names using `index`

In [225]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


Or sort columns

In [226]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


##### From a list of dicts

Using a list of dicts, each dict will be used as row, and keys/values as column names and datum.

In [229]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


As in previous cases, with no explicit `index` naming, we can use it to set row names

In [230]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [231]:
pd.DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


##### From another `DataFrame`


We will see how to select data from other `DataFrames`, but it can be useful to slice a `DataFrame` to create a new `DataFrame`

In [232]:
pd.DataFrame(df)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


There are other ways of constructing `DataFrames`, for further reference, visit: [Alternate constructors](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)

### Column selection, addition, deletion
*(from [Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion))*

You can treat a `DataFrame` semantically like a dict of like-indexed `Series` objects. 

Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [235]:
import pandas as pd
df = pd.DataFrame({"one":[1.6,2.2,3.4,3.5],"two":[1.5,2.1,3.9,np.nan],"three":[1.2,2.80,3.80,np.nan]})
df

Unnamed: 0,one,two,three
0,1.6,1.5,1.2
1,2.2,2.1,2.8
2,3.4,3.9,3.8
3,3.5,,


In [238]:
df['four'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,two,three,four,flag
0,1.6,1.5,1.2,2.4,False
1,2.2,2.1,2.8,4.62,True
2,3.4,3.9,3.8,13.26,True
3,3.5,,,,True


Note that using this syntax, the result will be a `Series` object

In [239]:
type(df['one'])

pandas.core.series.Series

Columns can be deleted or popped like with a dict:

In [240]:
del df['two']
three = df.pop('three')
df

Unnamed: 0,one,four,flag
0,1.6,2.4,False
1,2.2,4.62,True
2,3.4,13.26,True
3,3.5,,True


When inserting a scalar value, it will naturally be propagated to fill the column:

In [241]:
df['foo'] = 'bar'
df

Unnamed: 0,one,four,flag,foo
0,1.6,2.4,False,bar
1,2.2,4.62,True,bar
2,3.4,13.26,True,bar
3,3.5,,True,bar


When inserting a `Series` that does not have the same `index` as the `DataFrame`, it will be conformed to the `DataFrame`’s `index`:

In [60]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,three,two,four,flag,foo,one_trunc
0,1.6,1.2,1.5,2.4,False,bar,1.6
1,2.2,2.8,2.1,4.62,True,bar,2.2
2,3.4,3.8,3.9,13.26,True,bar,
3,3.5,,,,True,bar,


In [242]:
df['one'][:2]

0    1.6
1    2.2
Name: one, dtype: float64

You can insert raw `ndarrays` but their length must match the length of the `DataFrame`’s index.

By default, columns get inserted at the end. 

The `insert` method is available to insert at a particular location in the columns:

In [243]:
df.insert(1, 'rand', np.random.randint(1,10,df["one"].size))
df

Unnamed: 0,one,rand,four,flag,foo
0,1.6,5,2.4,False,bar
1,2.2,5,4.62,True,bar
2,3.4,4,13.26,True,bar
3,3.5,4,,True,bar


## `DataFrame` slicing

Cuting `DataFrames` into smaller `DataFrames` or `Series` is called slicing. 

It is a basic operation in the `pandas` library. We can do it in several ways, let's have an overview to the most used ones.

##### Square bracket syntax
We can pass a column name, rows slice or a Boolean list
* `df[col]`: Select column *(returns Series)*	
* `df[5:10]`: Slice rows *(returns DataFrame)* 	                       	          
* `df[bool_vec]`: Select rows by boolean vector *(returns DataFrame)*   	   	      

In [244]:
df

Unnamed: 0,one,rand,four,flag,foo
0,1.6,5,2.4,False,bar
1,2.2,5,4.62,True,bar
2,3.4,4,13.26,True,bar
3,3.5,4,,True,bar


In [246]:
# Select column
df["one"]

0    1.6
1    2.2
2    3.4
3    3.5
Name: one, dtype: float64

In [247]:
# Select row by integer location
df[2:4]

Unnamed: 0,one,rand,four,flag,foo
2,3.4,4,13.26,True,bar
3,3.5,4,,True,bar


In [248]:
# Select rows by boolean vector
df[[True,True,True,False]]

Unnamed: 0,one,rand,four,flag,foo
0,1.6,5,2.4,False,bar
1,2.2,5,4.62,True,bar
2,3.4,4,13.26,True,bar


##### DataFrame.loc[`rows`, `columns`] attribute:
* It expects index values as `rows` and `columns`
* We can use the syntactic sugar `:` to select all `rows` or `columns`

In [None]:
# Slice the whole DF
print(df.loc[:,:])

In [249]:
# Slice one row
print(df.loc[1,:])


one      2.2
rand       5
four    4.62
flag    True
foo      bar
Name: 1, dtype: object


In [250]:
# Slice one column
df.loc[:,"one"]

0    1.6
1    2.2
2    3.4
3    3.5
Name: one, dtype: float64

In [252]:
# Slice two columns
df.loc[:,["one","rand"]]

Unnamed: 0,one,rand
0,1.6,5
1,2.2,5
2,3.4,4
3,3.5,4


In [251]:
# Slice two columns and two rows
df.loc[[1,2],["one","rand"]]

Unnamed: 0,one,rand
1,2.2,5
2,3.4,4


##### DataFrame.iloc[`rows`, `columns`] attribute:
* It expects `rows` and `columns` positions
* We can use the syntactic sugar `:` to select all `rows` or `columns`

In [253]:
# Slice the whole DF
print(df.iloc[:,:])

   one  rand   four   flag  foo
0  1.6     5   2.40  False  bar
1  2.2     5   4.62   True  bar
2  3.4     4  13.26   True  bar
3  3.5     4    NaN   True  bar


In [254]:
# Slice one row
df.iloc[1,:]

one      2.2
rand       5
four    4.62
flag    True
foo      bar
Name: 1, dtype: object

In [255]:
# Slice one column
df.iloc[:,1]

0    5
1    5
2    4
3    4
Name: rand, dtype: int64

In [256]:
# Slice two columns
df.iloc[:,[1,2]]

Unnamed: 0,rand,four
0,5,2.4
1,5,4.62
2,4,13.26
3,4,


In [257]:
# Slice two columns and two rows
df.iloc[[1,2],[1,2]]

Unnamed: 0,rand,four
1,5,4.62
2,4,13.26


### Loading DataFrames

We will focus on three formats to read and store our data in disk:
* **CSV:** comma separated value. Two standard sparators:
    * comma: american
    * semicolon: eurpean 
* **XLS:** excel file (xls or xlsx)
* **Pickle:** python serialized file format

Pandas provide functions to load csv, xls and pickle files:
* Reading:
  * `pandas.read_csv(...)`: [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
  * `pandas.read_excel(...)`: [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)
  * `pandas.read_pickle(...)` [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html)
* Writting:
  * `object.to_csv(...)`: [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html)
  * `object.to_excel(...)`: [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html)
  * `object.to_pickle(...)`: [API Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html)
  

##### From CSV

In [None]:
iqsize = pd.read_csv("https://raw.githubusercontent.com/f-guitart/data_mining/master/data/iqsize.csv")
iqsize.head()

In [None]:
type(iqsize)

In [None]:
iqsize["sex"][:10]

In [None]:
iqsize["sex"].to_csv("myseries.csv")
%ls myseries.csv

###### From XLSX

With excel files we can start the other way round: writting first.

In [None]:
iqsize.to_excel("iqsize.xlsx")
%ls iqsize.xlsx

In [None]:
xls_iqsize = pd.read_excel("iqsize.xlsx")
xls_iqsize.head()

##### From Pickle
Why do we need serialized wirtting features?

Becasue sometimes we do want to store wierd things:
* DataFrames with dictionaries, lists or objects in columns
* Dictionaries of dataframes

In [None]:
my_df = pd.DataFrame({"a" : [{"apples": [1,2,3,4,6], "pears":2}, None, None, {"bannanas":4}],
                     "b" : [0,1,2,3]})
my_df.to_csv("mydf.csv")

In [None]:
my_df2 = pd.read_csv("mydf.csv")
type(my_df2.iloc[0,1])

In [None]:
my_df.to_pickle("mydf.pickle")
my_df3 = pd.read_pickle("mydf.pickle")
my_df3.head()

In [None]:
type(my_df3.iloc[0,0])

In [None]:
train = pd.Series([1,2,3,4,5,6,7,8])
test = pd.Series([9,10,11])

pd.to_pickle({"train": train,
             "test" : test},"my_pickle.pickle")
%ls my_pickle.pickle

In [None]:
my_pickle = pd.read_pickle("my_pickle.pickle")
my_pickle.keys()

In [None]:
type(my_pickle['train'])

### Examples with DataFrames

We will go through these examples using a real dataset. 

Remember from Series examples that we loaded a csv. 

We did it using <code> csv </code> library, however, Pandas provide the necessary tools to read a csv file and output a DataFrame. 

Let's see an example using [OpenFlight](http://openflights.org/data.html) data. 

The data is structured as follows:

* **Airport ID**	Unique OpenFlights identifier for this airport.
* **Name**	Name of airport. May or may not contain the City name.
* **City**	Main city served by airport. May be spelled differently from Name.
* **Country**	Country or territory where airport is located.
* **IATA/FAA**	3-letter FAA code, for airports located in Country "United States of America". 3-letter IATA code, for all other airports. Blank if not assigned.
* **ICAO**	4-letter ICAO code. Blank if not assigned.
* **Latitude**	Decimal degrees, usually to six significant digits. Negative is South, positive is North.
* **Longitude**	Decimal degrees, usually to six significant digits. Negative is West, positive is East.
* **Altitude**	In feet.
* **Timezone**	Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
* **DST**	Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time
* **Tz database time zone**	Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".

In [297]:
import pandas as pd
import requests
import io

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = requests.get(url).content
head = ["Name", "City", "Country",  "IATA/FAA", "ICAO", "Latitude", "Longitude", 
        "Altitude", "DST", "unk", "Timezone", "Type", "Source"]
data_frame = pd.read_csv(io.BytesIO(response), names=head,)
data_frame.head()

Unnamed: 0,Name,City,Country,IATA/FAA,ICAO,Latitude,Longitude,Altitude,DST,unk,Timezone,Type,Source
1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.08169,145.391998,5282,10,U,Pacific/Port_Moresby,airport,OurAirports
2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,20,10,U,Pacific/Port_Moresby,airport,OurAirports
3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.82679,144.296005,5388,10,U,Pacific/Port_Moresby,airport,OurAirports
4,Nadzab Airport,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569803,146.725977,239,10,U,Pacific/Port_Moresby,airport,OurAirports
5,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM,AYPY,-9.44338,147.220001,146,10,U,Pacific/Port_Moresby,airport,OurAirports


In [298]:
data_frame.dtypes

Name          object
City          object
Country       object
IATA/FAA      object
ICAO          object
Latitude     float64
Longitude    float64
Altitude       int64
DST           object
unk           object
Timezone      object
Type          object
Source        object
dtype: object

In [299]:
data_frame["Name"].head()

1                                 Goroka Airport
2                                 Madang Airport
3                   Mount Hagen Kagamuga Airport
4                                 Nadzab Airport
5    Port Moresby Jacksons International Airport
Name: Name, dtype: object

In [300]:
(data_frame["Name"] + data_frame["ICAO"]).head()

1                                 Goroka AirportAYGA
2                                 Madang AirportAYMD
3                   Mount Hagen Kagamuga AirportAYMH
4                                 Nadzab AirportAYNZ
5    Port Moresby Jacksons International AirportAYPY
dtype: object

In [301]:
data_frame["Altitude (m)"] = (data_frame["Altitude"] * 0.3048)
data_frame["Seaside"] = data_frame["Altitude (m)"] < 20
data_frame[data_frame["Seaside"]].head()

Unnamed: 0,Name,City,Country,IATA/FAA,ICAO,Latitude,Longitude,Altitude,DST,unk,Timezone,Type,Source,Altitude (m),Seaside
2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,20,10,U,Pacific/Port_Moresby,airport,OurAirports,6.096,True
6,Wewak International Airport,Wewak,Papua New Guinea,WWK,AYWK,-3.58383,143.669006,19,10,U,Pacific/Port_Moresby,airport,OurAirports,5.7912,True
11,Akureyri Airport,Akureyri,Iceland,AEY,BIAR,65.660004,-18.072701,6,0,N,Atlantic/Reykjavik,airport,OurAirports,1.8288,True
13,Hornafjörður Airport,Hofn,Iceland,HFN,BIHN,64.295601,-15.2272,24,0,N,Atlantic/Reykjavik,airport,OurAirports,7.3152,True
14,Húsavík Airport,Husavik,Iceland,HZK,BIHU,65.952301,-17.426001,48,0,N,Atlantic/Reykjavik,airport,OurAirports,14.6304,True


In [302]:
# Columns can be deleted or popped like with a dict:
del data_frame["Altitude"]
seaside = data_frame.pop('Seaside')
print(seaside.head())
data_frame.head()

1    False
2     True
3    False
4    False
5    False
Name: Seaside, dtype: bool


Unnamed: 0,Name,City,Country,IATA/FAA,ICAO,Latitude,Longitude,DST,unk,Timezone,Type,Source,Altitude (m)
1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.08169,145.391998,10,U,Pacific/Port_Moresby,airport,OurAirports,1609.9536
2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,10,U,Pacific/Port_Moresby,airport,OurAirports,6.096
3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.82679,144.296005,10,U,Pacific/Port_Moresby,airport,OurAirports,1642.2624
4,Nadzab Airport,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569803,146.725977,10,U,Pacific/Port_Moresby,airport,OurAirports,72.8472
5,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM,AYPY,-9.44338,147.220001,10,U,Pacific/Port_Moresby,airport,OurAirports,44.5008


In [303]:
# When inserting a scalar value, it will naturally be propagated to fill the column:
data_frame["Infrastructure"] = "Airport" 

## Exercises

**Exercise 1:** Load iqsize.csv using csv library. The result should be a DataFrame.

**Exercise 2:** Identify dataset variables. Resolve type problems. Can you change types?

**Exercise 3:** Check the range of quantitative variables. Are they correct? If not correct how would you correct it (don't expend many time). 

**Exercise 4:** Check the labels of qualitative variables. Are they correct? If not correct them, how would you correct them?

**Exercise 5:** For quantitative variables, compute the mean and median.

**Exercise 6:** For qualitative variables, count how many observations of each label exist.

**Exercise 7:** Compute Exercise 7 statistics, but now for each label in Sex variable.

## References
* Pandas API Reference: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
* Getting Started with Pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html