In [39]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd


# Introduction to Python *-* pandas

---

<br>
EDAA

Albert Ruiz

## Agenda

* Introduction to pd.Series
* Introduction to pd.DataFrame
* Essential functionality
* Summarizing and descriptive statistics
* Loading and storage

<h1 class="center_text">Introduction to pd.Series</h1>

## What is a pd.Series?

A series is a one-dimensional array-like object containing a sequence of *values* and an associated array of *labels* used as index.

In [40]:
# Default index
obj = pd.Series([10, 20, 30])
obj

# Custom labels
obj = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
obj

# Labels can be numbers and values can be strings too
obj = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
obj

0    10
1    20
2    30
dtype: int64

a    10
b    20
c    30
dtype: int64

10    a
20    b
30    c
dtype: object

## Basic selection

Compared with NumPy arrays, you can use labels when selecting values.

In [41]:
# Default indexes
obj = pd.Series([10, 20, 30])

# Single value
f"Accessing a single value: {obj[2]}"

# Set of values
f"Accessing a set of values:"
obj[1:3]

# Custom labels
obj = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Single value
f"Accessing a single value by label: {obj['c']}"

# Set of values
f"Accessing a set of values by labels:"
obj[['b', 'c']]


'Accessing a single value: 30'

'Accessing a set of values:'

1    20
2    30
dtype: int64

'Accessing a single value by label: 30'

'Accessing a set of values by labels:'

b    20
c    30
dtype: int64

## pd.Series attributes

Alls series have the following attributes:

* `dtype` - Return the dtype object of the underlying data.
* `hasnans` - Return if I have any nans; enables various perf speedups.
* `iat` - Access a single value for a row/column pair by integer position.
* `index` - The index (labels) of the Series.
* `is_monotonic` - Return True if values in the object are monotonic_increasing.
* `is_monotonic_decreasing` - Return True if values in the object are monotonic_decreasing.
* `is_unique` - Return True if values in the object are unique.
* `loc` - Access a group of rows and columns by label(s) or a boolean array.
* `ndim` - Number of dimensions of the underlying data.
* `shape` - Return a tuple of the shape of the underlying data.
* `size` - Return the number of elements in the underlying data.
* `values` - Return Series as ndarray or ndarray-like depending on the dtype.

You can find the full list of attributes in this [link](https://pandas.pydata.org/docs/reference/api/pandas.Series.html).

## Using pd.Series attributes

In [42]:
obj = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Value and index
f"Values: {obj.values}"
f"Type: {type(obj.values)}"
f"Indexes: {obj.index}"

# Accessing
f"Accessing by integer position: {obj.iat[2]}"
f"Accessing by label: {obj.loc['c']}"

# Monotonic
f"Is monotonic: {obj.is_monotonic}"

# Unique
f"Is unique: {obj.is_unique}"

# Dimension, shape and size
f"Number of dimensions: {obj.ndim}"
f"Shape: {obj.shape}"
f"Size: {obj.size}"

'Values: [10 20 30]'

"Type: <class 'numpy.ndarray'>"

"Indexes: Index(['a', 'b', 'c'], dtype='object')"

'Accessing by integer position: 30'

'Accessing by label: 30'

'Is monotonic: True'

'Is unique: True'

'Number of dimensions: 1'

'Shape: (3,)'

'Size: 3'

## pd.Series methods

All series have the following methods:

* `abs()` - Return a Series with absolute numeric value of each element.
* `add()` - Add value to series, element-wise.
* `mul()` and `div()` - Multiply/Divide by value or series, element wise.
* `pow()` - Return exponential power of series and value or series.
* `all()` - Return whether all elements are True.
* `any()` - Return whether any elements are True.
* `append()` - Concatenate series.
* `argmax()` and `argmin()` - Return the int position of the largest/smallest value.
* `max()` and `min()` - Return the maximum/minimumb value.
* `sum()` - Return the sum of the values.
* `mean()` and `median()` - Returnthe mean and the median of the values.

You can find the full list of attributes in this [link](https://pandas.pydata.org/docs/reference/api/pandas.Series.html).

## Using pd.Series methods

In [43]:
obj = pd.Series([1, 5, 2], index=['a', 'b', 'c'])

#  Multiply by value
"Multiply by value:"
obj.mul(2)

# Multiply by series
"Multiply by series:"
obj.mul(obj)

# Sum, mean and median
f"Sum: {obj.sum()}"
f"Mean: {obj.mean()}"
f"Median: {obj.median()}"

# Findind maximum value
f"Max: {obj.max()}"
f"Max value is at: {obj.argmax()}"

'Multiply by value:'

a     2
b    10
c     4
dtype: int64

'Multiply by series:'

a     1
b    25
c     4
dtype: int64

'Sum: 8'

'Mean: 2.6666666666666665'

'Median: 2.0'

'Max: 5'

'Max value is at: 1'

# Using NumPy functions

NumPy functions and NumPy-like operations (filtering, scalar multiplication, applying maths functions...) can be used with pd.Series. Index-value links are preserved.

In [44]:
obj = pd.Series([1, 5, 2], index=['a', 'b', 'c'])

# NumPy-like operations examples
"Boolean filter:"
obj[obj >= 2]

"Element wise multiplication:"
obj * 2

# Numpy functions
"Power:"
np.power(obj, 3)

"Flip:"
np.flip(obj)

'Boolean filter:'

b    5
c    2
dtype: int64

'Element wise multiplication:'

a     2
b    10
c     4
dtype: int64

'Power:'

a      1
b    125
c      8
dtype: int64

'Flip:'

c    2
b    5
a    1
dtype: int64

<h1 class="center_text">Introduction to pd.DataFrame</h1>

## What is a pd.DataFrame?

A DataFrame represents a rectangular table of data.

It contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

In [45]:
data = {
    "name": ["Max", "Sarah", "John"],
    "surname": ["Rockatansky", "Connor", "McClane"],
    "sex": ["M", "F", "M"],
     "age": [35, 25, 40],
     "country": ["AU", "US", "US"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,surname,sex,age,country
0,Max,Rockatansky,M,35,AU
1,Sarah,Connor,F,25,US
2,John,McClane,M,40,US


## Constructors

In [46]:
"From dict:"
df = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40],
     "country": ["AU", "US", "US"]}
)
df

"From list of lists (or list of tuples):"
df = pd.DataFrame(
    [["Max", "Rockatansky", "M", 35],
     ["Sarah", "Connor", "F", 25],
     ["John", "McClane", "M", 40]]
)
df

'From dict:'

Unnamed: 0,name,surname,sex,age,country
0,Max,Rockatansky,M,35,AU
1,Sarah,Connor,F,25,US
2,John,McClane,M,40,US


'From list of lists (or list of tuples):'

Unnamed: 0,0,1,2,3
0,Max,Rockatansky,M,35
1,Sarah,Connor,F,25
2,John,McClane,M,40


## Custom indexes

By default, rows are numbered. You can also define a label for each row:

In [47]:
df = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40],
     "country": ["AU", "US", "US"]},
    index=["a", "b", "c"]
)
df

Unnamed: 0,name,surname,sex,age,country
a,Max,Rockatansky,M,35,AU
b,Sarah,Connor,F,25,US
c,John,McClane,M,40,US


<h1 class="center_text">Essential fucntionality</h1>

### *Note*

This section only covers functions used with pd.DataFrames. However, most of them can also be used with pd.Series.

## Reindexing

Reindexing means tocreate a new object with the data conformed to a new index.

In [48]:
"Initial dataframe:"
df_1 = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40],
     "country": ["AU", "US", "US"]},
    index=["a", "b", "c"]
)
df_1

"Reindexing:"
df_2 = df_1.reindex(['a', 'c', 'b'])
df_2

"Reindexing and adding a new index:"
df_3 = df_1.reindex(['a', 'c', 'd', 'b'])
df_3

'Initial dataframe:'

Unnamed: 0,name,surname,sex,age,country
a,Max,Rockatansky,M,35,AU
b,Sarah,Connor,F,25,US
c,John,McClane,M,40,US


'Reindexing:'

Unnamed: 0,name,surname,sex,age,country
a,Max,Rockatansky,M,35,AU
c,John,McClane,M,40,US
b,Sarah,Connor,F,25,US


'Reindexing and adding a new index:'

Unnamed: 0,name,surname,sex,age,country
a,Max,Rockatansky,M,35.0,AU
c,John,McClane,M,40.0,US
d,,,,,
b,Sarah,Connor,F,25.0,US


## Reseting index

In [49]:
"Initial dataframe:"
df_1 = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40]},
    index=["a", "b", "c"]
)
df_1

"Reseting without dropping:"
df_2 = df_1.reset_index(drop=False)
df_2

"Reseting with dropping:"
df_3 = df_1.reset_index(drop=True)
df_3

'Initial dataframe:'

Unnamed: 0,name,surname,sex,age
a,Max,Rockatansky,M,35
b,Sarah,Connor,F,25
c,John,McClane,M,40


'Reseting without dropping:'

Unnamed: 0,index,name,surname,sex,age
0,a,Max,Rockatansky,M,35
1,b,Sarah,Connor,F,25
2,c,John,McClane,M,40


'Reseting with dropping:'

Unnamed: 0,name,surname,sex,age
0,Max,Rockatansky,M,35
1,Sarah,Connor,F,25
2,John,McClane,M,40


## Dropping rows and columns

In [50]:
"Initial dataframe:"
df_1 = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40]},
    index=["a", "b", "c"]
)
df_1

"Dropping rows:"
df_2 = df_1.drop(["c", "b"])
df_2

"Dropping columns:"
df_3 = df_1.drop(["sex", "surname"], axis='columns')
df_3

'Initial dataframe:'

Unnamed: 0,name,surname,sex,age
a,Max,Rockatansky,M,35
b,Sarah,Connor,F,25
c,John,McClane,M,40


'Dropping rows:'

Unnamed: 0,name,surname,sex,age
a,Max,Rockatansky,M,35


'Dropping columns:'

Unnamed: 0,name,age
a,Max,35
b,Sarah,25
c,John,40


## Indexing

Indexing into a DataFrame is for retrieveng one or more columns.

In [51]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3,4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"One column:"
df["three"]

"Multiple columns:"
df[["three", "one"]]

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'One column:'

france       2
italy        6
slovakia    10
Name: three, dtype: int64

'Multiple columns:'

Unnamed: 0,three,one
france,2,0
italy,6,4
slovakia,10,8


## Filtering: conditional indexing

In [52]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3,4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"One condition:"
df[df["two"] >= 5]

"Multiple conditions (or):"
df[(df["two"] >= 5) | (df["three"] >= 8)]

"Multiple conditions (and):"
df[(df["two"] >= 5) & (df["three"] >= 8)]

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'One condition:'

Unnamed: 0,one,two,three,four
italy,4,5,6,7
slovakia,8,9,10,11


'Multiple conditions (or):'

Unnamed: 0,one,two,three,four
italy,4,5,6,7
slovakia,8,9,10,11


'Multiple conditions (and):'

Unnamed: 0,one,two,three,four
slovakia,8,9,10,11


## Selection with loc

`loc` is a special indexing operator to select a subset of rows and columns by label.

In [53]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"Selecting by row:"
df.loc["italy"]

f"Selecting by pair row-col: {df.loc['italy', 'three']}"

"Selecting muliple rows and cols:"
df.loc[["france", "italy"], ["one", "four"]]

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Selecting by row:'

one      4
two      5
three    6
four     7
Name: italy, dtype: int64

'Selecting by pair row-col: 6'

'Selecting muliple rows and cols:'

Unnamed: 0,one,four
france,0,3
italy,4,7


## Selecting with iloc

`iloc` is a special indexing operator to select a subset of rows and columns by position.

In [54]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"Selecting by row:"
df.iloc[1]

f"Selecting by pair row-col: {df.iloc[1, 2]}"

"Selecting muliple rows and cols:"
df.iloc[[0, 1], [0, 3]]

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Selecting by row:'

one      4
two      5
three    6
four     7
Name: italy, dtype: int64

'Selecting by pair row-col: 6'

'Selecting muliple rows and cols:'

Unnamed: 0,one,four
france,0,3
italy,4,7


## Selecting with loc */* iloc with slicing

Both `loc` and `iloc` work with slices

In [55]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"loc with slicing:"
df.loc[:"italy", "two":"four"]

"iloc with slicing:"
df.iloc[:2, 1:4]

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'loc with slicing:'

Unnamed: 0,two,three,four
france,1,2,3
italy,5,6,7


'iloc with slicing:'

Unnamed: 0,two,three,four
france,1,2,3
italy,5,6,7


## Select single scalar with at

`at` is a special indexing operator to select a single scalar by row and column label.

In [56]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

f"Selecting single scalar: {df.at['slovakia', 'two']}"

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Selecting single scalar: 9'

## Select single scalar with at

`at` is a special indexing operator to select a single scalar by row and column position.

In [57]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

f"Selecting single scalar: {df.iat[2, 1]}"

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Selecting single scalar: 9'

## Arithmetic operators and methods with fill values (1/4

pandas includes some aritmeic operations:

In [58]:
"Initial dataframes:"
df_1 = pd.DataFrame(
    np.arange(9).reshape(3, 3),
    columns=["a", "b", "c"],
)
df_1

df_2 = pd.DataFrame(
    np.arange(16).reshape(4, 4),
    columns=["a", "b", "c", "d"],
)
df_2

"Adding with operator:"
df_1 + df_2

"Adding with add() method:"
df_1.add(df_2)

'Initial dataframes:'

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


'Adding with operator:'

Unnamed: 0,a,b,c,d
0,0.0,2.0,4.0,
1,7.0,9.0,11.0,
2,14.0,16.0,18.0,
3,,,,


'Adding with add() method:'

Unnamed: 0,a,b,c,d
0,0.0,2.0,4.0,
1,7.0,9.0,11.0,
2,14.0,16.0,18.0,
3,,,,


## Arithmetic operators and methods with fill values (2/4)

List of operations:

* `+` operator and `add()` method for addition
* `-` operator and `sub()` method for substraction
* `*` operator and `mul()` method for multiplication
* `/` operator and `div()` method for division
* `//` operator and `floordiv()` method for floor dividion
* `**` operator and `pow()` method for exponentiation

## Arithmetic operators and methods with fill values (3/4)

Some examples:

In [59]:
"Multiplying with operator:"
df_1 * df_2

"Multiplying with mul() method:"
df_1.mul(df_2)

'Multiplying with operator:'

Unnamed: 0,a,b,c,d
0,0.0,1.0,4.0,
1,12.0,20.0,30.0,
2,48.0,63.0,80.0,
3,,,,


'Multiplying with mul() method:'

Unnamed: 0,a,b,c,d
0,0.0,1.0,4.0,
1,12.0,20.0,30.0,
2,48.0,63.0,80.0,
3,,,,


## Arithmetic operators and methods with fill values (4/4)

Arithmetic methods have the `fill_value` parameter:

In [60]:
"Multiplying with mul() method:"
df_1.mul(df_2, fill_value=0)

"Exponentiation with pow() method:"
df_1.pow(df_2, fill_value=0)

'Multiplying with mul() method:'

Unnamed: 0,a,b,c,d
0,0.0,1.0,4.0,0.0
1,12.0,20.0,30.0,0.0
2,48.0,63.0,80.0,0.0
3,0.0,0.0,0.0,0.0


'Exponentiation with pow() method:'

Unnamed: 0,a,b,c,d
0,1.0,1.0,4.0,0.0
1,81.0,1024.0,15625.0,0.0
2,1679616.0,40353607.0,1073742000.0,0.0
3,0.0,0.0,0.0,0.0


## Function application

It is possible to apply a function on one-dimensional arrays to each column or row:

In [61]:
# Dummy function
def f(x):
    return x.max() - x.min()

"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"Apply to each column:"
df.apply(f)

"Apply to each row:"
df.apply(f, axis='columns')

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Apply to each column:'

one      8
two      8
three    8
four     8
dtype: int64

'Apply to each row:'

france      3
italy       3
slovakia    3
dtype: int64

## Lambda functions

Functions to apply can be defined on the fly as lambda functions:

In [62]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"Apply to each column:"
df.apply(lambda x: x.max() - x.min())

"Apply to each row:"
df.apply(lambda x: x.max() - x.min(),
         axis='columns')

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Apply to each column:'

one      8
two      8
three    8
four     8
dtype: int64

'Apply to each row:'

france      3
italy       3
slovakia    3
dtype: int64

# Sorting by index

In [63]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"After sorting by row index:"
df = df.sort_index(ascending=False)
df

"After sorting by column index:"
df = df.sort_index(axis='columns')
df

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'After sorting by row index:'

Unnamed: 0,one,two,three,four
slovakia,8,9,10,11
italy,4,5,6,7
france,0,1,2,3


'After sorting by column index:'

Unnamed: 0,four,one,three,two
slovakia,11,8,10,9
italy,7,4,6,5
france,3,0,2,1


## Sorting by value

In [64]:
"Intial dataframe:"
df = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40]}
)
df

"Sorting by one column:"
df = df.sort_values(by="sex")
df

"Sorting by multiple columns:"
df = df.sort_values(by=["sex", "name"])
df

'Intial dataframe:'

Unnamed: 0,name,surname,sex,age
0,Max,Rockatansky,M,35
1,Sarah,Connor,F,25
2,John,McClane,M,40


'Sorting by one column:'

Unnamed: 0,name,surname,sex,age
1,Sarah,Connor,F,25
0,Max,Rockatansky,M,35
2,John,McClane,M,40


'Sorting by multiple columns:'

Unnamed: 0,name,surname,sex,age
1,Sarah,Connor,F,25
2,John,McClane,M,40
0,Max,Rockatansky,M,35


## Ranking

Ranking assigns ranks from one through the number of valid data points in an array.

In [65]:
"Intial dataframe:"
df = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40]}
)
df

"Ranking:"
df.rank()

'Intial dataframe:'

Unnamed: 0,name,surname,sex,age
0,Max,Rockatansky,M,35
1,Sarah,Connor,F,25
2,John,McClane,M,40


'Ranking:'

Unnamed: 0,name,surname,sex,age
0,2.0,3.0,2.5,2.0
1,3.0,1.0,1.0,1.0
2,1.0,2.0,2.5,3.0


<h1 class="center_text">Summarizing and descriptive statistics</h1>

## Reduction methods (1/2)

pandas objects are equipped with a set of common mathematical and statistical methods for reduction (or summary statistics).

* `count()` - Number of non-NA values.
* `describe()` - Compute a set of statistics of each column.
* `min()`, `max()` - Compute minimum and maximum values
* `argmin()`, `argmax()` - Compute index locations of minimum and maximum values.
* `sum()` - Sum of values.
* `mean()` - Mean of values.
* `median()` - Arithmetic median.
* `prod()` - Product of values.
* `var()` - Sample variance.
* `std()`- Sample standard deviation.
* `quantile()` - Compute sample quantile ranging from 0 to 1.

## Reduction methods (2/2)

Example:

In [66]:
"Initial dataframe:"
df = pd.DataFrame(
    np.arange(12).reshape(3, 4),
    columns=["one", "two", "three", "four"],
    index=["france", "italy", "slovakia"]
)
df

"Describe:"
df.describe()

"Sum:"
df.sum()

'Initial dataframe:'

Unnamed: 0,one,two,three,four
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Describe:'

Unnamed: 0,one,two,three,four
count,3.0,3.0,3.0,3.0
mean,4.0,5.0,6.0,7.0
std,4.0,4.0,4.0,4.0
min,0.0,1.0,2.0,3.0
25%,2.0,3.0,4.0,5.0
50%,4.0,5.0,6.0,7.0
75%,6.0,7.0,8.0,9.0
max,8.0,9.0,10.0,11.0


'Sum:'

one      12
two      15
three    18
four     21
dtype: int64

## Unique values

`unique()`returns unique values in a column.

In [74]:
"Intial dataframe:"
df = pd.DataFrame(
    {"name": ["Max", "Sarah", "John"],
     "surname": ["Rockatansky", "Connor", "McClane"],
     "sex": ["M", "F", "M"],
     "age": [35, 25, 40],
     "country": ["AU", "US", "US"]}
)
df

"Unique name values"
df["name"].unique()

"Unique countries:"
df["country"].unique()

df.value_counts()

'Intial dataframe:'

Unnamed: 0,name,surname,sex,age,country
0,Max,Rockatansky,M,35,AU
1,Sarah,Connor,F,25,US
2,John,McClane,M,40,US


'Unique name values'

array(['Max', 'Sarah', 'John'], dtype=object)

'Unique countries:'

array(['AU', 'US'], dtype=object)

name   surname      sex  age  country
Sarah  Connor       F    25   US         1
Max    Rockatansky  M    35   AU         1
John   McClane      M    40   US         1
dtype: int64

## Value counts

`value_counts` computes unique rows in the dataframe.

In [79]:
"Initial dataframe"
df = pd.DataFrame(
    [[2008, "Hamilton", "McLaren"],
     [2009, "Button", "Brawn"],
     [2010, "Vettel", "Red Bull"],
     [2011, "Vettel", "Red Bull"],
     [2012, "Vettel", "Red Bull"],
     [2013, "Vettel", "Red Bull"],
     [2014, "Hamilton", "Mercedes"]],
    columns=["year", "driver", "constructor"] 
)
df = df.set_index("year")
df


"Unique rows"
df.value_counts()

'Initial dataframe'

Unnamed: 0_level_0,driver,constructor
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,Hamilton,McLaren
2009,Button,Brawn
2010,Vettel,Red Bull
2011,Vettel,Red Bull
2012,Vettel,Red Bull
2013,Vettel,Red Bull
2014,Hamilton,Mercedes


'Unique rows'

driver    constructor
Vettel    Red Bull       4
Hamilton  Mercedes       1
          McLaren        1
Button    Brawn          1
dtype: int64

<h1 class="center_text">Loading and storage</h1>

## Parsing functions

pandas includes several functions for reading tabular data (from a file or URL) as a dataframe:

* `read_csv()`
* `read_table()`
* `read_excel()`
* `read_parquet()`
* `read_json()`
* `read_sql()`

## Read CSV file

Read a CSV file into DataFrame.

This method has many different parameters. You will find a full description in this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv).

In [99]:
"Read CSV, using default index"
df = pd.read_csv("samples/ex1.csv")
df

"Specifying header and index"
df = pd.read_csv("samples/ex1.csv",
                 header=0,
                 index_col="country")
df

"Remove rows"
df = pd.read_csv("samples/ex1.csv",
                 skiprows=[0,2])
df

'Read CSV, using default index'

Unnamed: 0,country,one,two,three,four
0,france,0,1,2,3
1,italy,4,5,6,7
2,slovakia,8,9,10,11


'Specifying header and index'

Unnamed: 0_level_0,one,two,three,four
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
france,0,1,2,3
italy,4,5,6,7
slovakia,8,9,10,11


'Remove rows'

Unnamed: 0,france,0,1,2,3
0,slovakia,8,9,10,11


## Writing data

pandas also includes several functions to export tabular data to a file:

* `to_csv()`
* `to_excel()`
* `to_parquet()`
* `to_json()`
* `to_sql()`


In [None]:
fillna