# Working with [Pandas](https://pandas.pydata.org)

#### Python Numeric Data Analysis or Pandas

- For Structured data such as that in ASCII tables (csv or xlsx), SQL tables, R data, or even Python tables
- Used to study heterogeneous data types and also time series data
- Cleaning up data and preparing it for analysis
- Analyse and passing it to other systems (like Scikit-Learn, TensorFlow, etc.)

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
%matplotlib inline
plt.rcParams['figure.figsize'] = [16,8]

import pandas as pd

### Series
Like a numpy array but with index labels like SQL databases.

In [12]:
s1 = pd.Series([1.1,1.2,1.3,1.4])
s1

0    1.1
1    1.2
2    1.3
3    1.4
dtype: float64

In [13]:
s1[1:3]

1    1.2
2    1.3
dtype: float64

In [14]:
s2 = pd.Series([1,2,3,4], index=['one', 'two','three','four'])
s2

one      1
two      2
three    3
four     4
dtype: int64

In [15]:
s2['three']

3

In [16]:
s2['one':'four':2]

one      1
three    3
dtype: int64

In [18]:
s1.index

RangeIndex(start=0, stop=4, step=1)

In [19]:
s2.index

Index(['one', 'two', 'three', 'four'], dtype='object')

In [20]:
s1.values

array([1.1, 1.2, 1.3, 1.4])

In [21]:
s2.values

array([1, 2, 3, 4])

In [31]:
populations = pd.Series( # not official population values
    {
        "London": 873438,
        "Barcelona": 586872,
        "Milan": 84375,
        "Paris": 384732,
        "Helsinki" : 87342
    }
)
populations

London       873438
Barcelona    586872
Milan         84375
Paris        384732
Helsinki      87342
dtype: int64

In [32]:
for k in populations.keys():
    print(k)

London
Barcelona
Milan
Paris
Helsinki


In [33]:
populations / 100000 # numbers in millions

London       8.73438
Barcelona    5.86872
Milan        0.84375
Paris        3.84732
Helsinki     0.87342
dtype: float64

In [35]:
(populations / 1000000).std()

0.3378667595461264

In [36]:
populations.idxmax()

'London'

**We can do boolean operations as well on pd Series**

c.f. `SELECT * FROM populations WHERE value > 100000;` in [SQL](https://www.postgresql.org)

In [38]:
populations[populations > 100000]

London       873438
Barcelona    586872
Paris        384732
dtype: int64

### DataFrames (very important)
![dataframe_image](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fvrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com%2Fwp-content%2Fuploads%2F2019%2F01%2Fpandas-dataframe-has-indexes.png&f=1&nofb=1)
*fig. : Basic structure of a dataframe*