# Day 5: Pandas

Probably the most important tool for a data scientist in python is pandas. Pandas is built on top of numpy, therefore there will be many similarites and mechanics you already know. Where numpy made heavy use of the `ndarray`, most magic happens in the pandas `DataFrame`. Like other data frames like in R, the pandas `DataFrame` stores data in a rectangular grid that can be easily overviewed. Numpy is mostly used for numerical data, while pandas can be used for any tabular data. Pandas also has many (really, a lot).


Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/index.html

In [2]:
import numpy as np
import pandas as pd # common way to import pandas

We will start with `Series` we can build one column of a `DataFrame`. A `Series` can also be seen as one feature of a dataset. They can easily be created from a list and are similar to 1-dimensional numpy arrays.

## Series

In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

`Series` can contain also strings or any other type of value.

In [4]:
t = pd.Series(["red", "green", "blue", "yellow", "purple", "black"])
t

0       red
1     green
2      blue
3    yellow
4    purple
5     black
dtype: object

When we print out the `Series` we see that we get two columns of values. The right one is the one we speciefied, and the left one is the index. Default, the index is just the integer index. We can also give it another index.

In [5]:
u = pd.Series(np.arange(5), index=list("ABCDE")) #create a series from numpy array and custom index
u.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

We can access the elements with the new specfied index

In [6]:
s[2], t[4], u["B"]

(5.0, 'purple', 1)

Next to creating your own index, pandas also offers many was to create an `Index`.
Some can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html
Some examples:
- `DatetimeIndex` for dates
- `TimedeltaIndex` for time steps
- `CategeorialIndex` for defined categories
-  

In [7]:
dates = pd.date_range('20200101', periods=6) # index for 6 days starting with 2020-01-01
dates

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
times = pd.timedelta_range(start=0, periods=6, freq="3s")
times

TimedeltaIndex(['00:00:00', '00:00:03', '00:00:06', '00:00:09', '00:00:12',
                '00:00:15'],
               dtype='timedelta64[ns]', freq='3S')

In [9]:
times_6H = pd.timedelta_range(start=0, periods=6, freq="6H")
times_6H

TimedeltaIndex(['0 days 00:00:00', '0 days 06:00:00', '0 days 12:00:00',
                '0 days 18:00:00', '1 days 00:00:00', '1 days 06:00:00'],
               dtype='timedelta64[ns]', freq='6H')

In [10]:
c = pd.CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'])
c

CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

In [11]:
c_ord = pd.CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True)
c_ord 


CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=True, dtype='category')

In [12]:
c_ord.min(), c_ord.max() # if ordered, can have min max values

('a', 'c')

In [13]:
v = pd.Series(np.arange(6), index=c_ord)
v["a"]

a    0
a    3
dtype: int64

To be honest, mostly the normal `RangeIndex` (default integer index) is used, and values such as time can be stored as a feature in another `Series` itself. But it is useful to know that we can use different indexes.

## DataFrames

Next is key element `DataFrame`, which is similar two a 2-dimensional numpy array, storing data in a grid. There are multiple ways to create a `DataFrame`.

In [14]:
df = pd.DataFrame() # empty dataframe
df.dtypes

Series([], dtype: object)

Like we have seen, a `DataFrame` consists of one or more `Series`. We can create them by joining them together.

In [15]:
s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5))
print(s1) 
print(s2)

0    0.759753
1    0.307819
2    0.133587
3    0.314277
4    0.244844
dtype: float64
0    0.364165
1    0.363603
2    0.468201
3    0.149263
4    0.836428
dtype: float64


In [16]:
df = pd.concat([s1,s2])
print(df) 
print(type(df)) # actually still a series

0    0.759753
1    0.307819
2    0.133587
3    0.314277
4    0.244844
0    0.364165
1    0.363603
2    0.468201
3    0.149263
4    0.836428
dtype: float64
<class 'pandas.core.series.Series'>


In [17]:
df = pd.concat([s1,s2], axis=1)
df

Unnamed: 0,0,1
0,0.759753,0.364165
1,0.307819,0.363603
2,0.133587,0.468201
3,0.314277,0.149263
4,0.244844,0.836428


In [18]:
df = pd.DataFrame(np.random.randn(6, 4)) # from numpy array
df

Unnamed: 0,0,1,2,3
0,-0.848523,0.08698,0.472497,-0.270254
1,-0.306851,-1.716282,1.582826,0.862437
2,-0.540366,1.129377,-1.219719,1.087776
3,0.603924,1.844844,0.722826,1.393889
4,-0.548497,-0.792563,1.02283,0.325947
5,0.937784,0.528206,0.008429,1.049149


When we print out a `DataFrame`, we see that now we have two indices, one for the rows and one for the columns. As with `Series`, we can specify those in the creation.

In [19]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) #row index as dates, and columns as category 
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.254748,1.086223,0.583008,-0.814746
2020-01-02,0.910987,-2.176922,-0.531515,0.546465
2020-01-03,1.213148,-0.952355,1.145491,-0.208332
2020-01-04,-3.640123,0.173859,0.967653,0.225832
2020-01-05,-0.972563,1.465113,0.329038,-0.507309
2020-01-06,0.810312,1.79457,2.126397,3.310585


Unlike numpy arrays, `DataFrame` can have multiple types. For each columns entry, we have one type.

In [20]:
df.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

We can also create `DataFrames` from python dictionaries.

In [21]:
df2 = pd.DataFrame({'A': 1,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,1,2013-01-02,1.0,3,train,foo
2,1,2013-01-02,1.0,3,test,foo
3,1,2013-01-02,1.0,3,train,foo


In [22]:
df2.dtypes # each column or Series has a different type

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

We also see, that broadcasting is applied if a value is not a list. Otherwise all list for each `Series` must have the same length.

In [23]:
df2 = pd.DataFrame({'A': 1,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 2, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
## will create error, because list are not the same length

ValueError: arrays must all be same length

Once we have build a DataFrame, we can access its columns also over function call. (Built in function from IPython). 

In [24]:
df2.A

0    1
1    1
2    1
3    1
Name: A, dtype: int64

In [25]:
df2.E

0     test
1    train
2     test
3    train
Name: E, dtype: category
Categories (2, object): [test, train]

### DataFrames from files

Since our data is usually stored in some file, pandas allow us to read many file types directly into a panda `DataFrame`. Very convienient! We also see, that pandas takes the headers as column index directly

In [39]:
student_performance_df = pd.read_csv('NewStudentPerformance.csv')
#student_performance_df

Just one line of code! wuhu!!

![](1_line_code.jpeg)

Pandas can also read and write to .xlsx (MS Excel) or .h5 (from the HDF group: https://www.hdfgroup.org/).
You might need some extra software installed for that!

In [37]:
df_excel = pd.read_excel('excel_example.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df_excel

In [40]:
#Store the former csv files into excel
df.to_excel('NewStudentPerformance.xlsx',sheet_name="Sheet1")

In [38]:
df_hdf = pd.read_hdf("hdf_example.h5",'df')
df_hdf

### DataFrame Basic Functions

Some basic functions for viewing the data.

In [None]:
df.head(2) # see 2 first items rows in df

In [None]:
df.tail(2) # see last two rows in df 

In [None]:
df.columns, df.index

We can display some quick statistics with the function `describe()`.

In [None]:
df.describe()

`describe()` only works for numerical dtypes. 

In [None]:
df2.describe()

We can also rearange and sort our DataFrames quickly.

In [42]:
df.T # Transprosing, just like in numpy

Unnamed: 0,2020-01-01 00:00:00,2020-01-02 00:00:00,2020-01-03 00:00:00,2020-01-04 00:00:00,2020-01-05 00:00:00,2020-01-06 00:00:00
A,-0.254748,0.910987,1.213148,-3.640123,-0.972563,0.810312
B,1.086223,-2.176922,-0.952355,0.173859,1.465113,1.79457
C,0.583008,-0.531515,1.145491,0.967653,0.329038,2.126397
D,-0.814746,0.546465,-0.208332,0.225832,-0.507309,3.310585


In [44]:
df.sort_index(axis=1, ascending=False) # Sorting by an axis:

Unnamed: 0,D,C,B,A
2020-01-01,-0.814746,0.583008,1.086223,-0.254748
2020-01-02,0.546465,-0.531515,-2.176922,0.910987
2020-01-03,-0.208332,1.145491,-0.952355,1.213148
2020-01-04,0.225832,0.967653,0.173859,-3.640123
2020-01-05,-0.507309,0.329038,1.465113,-0.972563
2020-01-06,3.310585,2.126397,1.79457,0.810312


In [45]:
df.sort_values(by='B') #sort by values in a column

Unnamed: 0,A,B,C,D
2020-01-02,0.910987,-2.176922,-0.531515,0.546465
2020-01-03,1.213148,-0.952355,1.145491,-0.208332
2020-01-04,-3.640123,0.173859,0.967653,0.225832
2020-01-01,-0.254748,1.086223,0.583008,-0.814746
2020-01-05,-0.972563,1.465113,0.329038,-0.507309
2020-01-06,0.810312,1.79457,2.126397,3.310585


### Selection

We can use the known indexing methods from python and numpy, but pandas also offer optimized function to select data.

In [47]:
df['A'] 

2020-01-01   -0.254748
2020-01-02    0.910987
2020-01-03    1.213148
2020-01-04   -3.640123
2020-01-05   -0.972563
2020-01-06    0.810312
Freq: D, Name: A, dtype: float64

In [46]:
df.A

2020-01-01   -0.254748
2020-01-02    0.910987
2020-01-03    1.213148
2020-01-04   -3.640123
2020-01-05   -0.972563
2020-01-06    0.810312
Freq: D, Name: A, dtype: float64

In [48]:
df[0:3] # slice rows

Unnamed: 0,A,B,C,D
2020-01-01,-0.254748,1.086223,0.583008,-0.814746
2020-01-02,0.910987,-2.176922,-0.531515,0.546465
2020-01-03,1.213148,-0.952355,1.145491,-0.208332


In [51]:
df['20200101':'20200104'] # slice rows with custom index

Unnamed: 0,A,B,C,D
2020-01-01,-0.254748,1.086223,0.583008,-0.814746
2020-01-02,0.910987,-2.176922,-0.531515,0.546465
2020-01-03,1.213148,-0.952355,1.145491,-0.208332
2020-01-04,-3.640123,0.173859,0.967653,0.225832
