# Quantrack Python crash course 7.

*Pandas* is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

* Ordered and unordered (not necessarily fixed-frequency) time series data.

* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

*NumPy* is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python

These two packages are at the core of any data-analysis project in Python

### 1. Getting started - Object creation
### 2. Basic commands - Part 1.
### 3. Basic commands - Part 2.

Pandas DataFrames are composed of series. Basically, Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call :

## 1. Getting Started

Question : 

* Try to understand np.random command
* What is the difference between a DataFrame and a Series ?

We can create a Series by passing a list of values, letting pandas create a default integer index: NaN means *Missing numerical data*

#### a. Pandas Series

In [21]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [22]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.935063
b   -1.667109
c    0.080092
d   -1.338738
e   -0.391146
dtype: float64

In [11]:
#print the index
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [15]:
#Series can be instantiated from dicts:
d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
s

b    1
a    0
c    2
dtype: int64

dtype means data-type, here the pandas series is filled with integers.

In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23, the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

#### b. Pandas Dataframes 

Let's create a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

1. We first create the date index
2. We create the DataFrame

In [27]:
#We create the index
dates = pd.date_range('20130101', periods=6)
print(dates)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


In [28]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.262555,-0.694846,1.166846,0.710166
2013-01-02,0.447195,0.923124,-0.644863,0.583534
2013-01-03,-1.093509,-1.516493,-1.063032,-0.623707
2013-01-04,0.296014,0.121483,-1.085829,-1.144755
2013-01-05,-2.140108,1.364389,0.710251,-0.757226
2013-01-06,1.231026,-2.015137,1.036093,-0.027952


## 2. Basic commands - Part 1.

Basic Pandas commands , let's use our dataframe

In [38]:
#First two lines
df.head(2)

Unnamed: 0,A,B,C,D
2013-01-01,-1.262555,-0.694846,1.166846,0.710166
2013-01-02,0.447195,0.923124,-0.644863,0.583534


In [39]:
#Last two lines
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,-2.140108,1.364389,0.710251,-0.757226
2013-01-06,1.231026,-2.015137,1.036093,-0.027952


In [40]:
#Index
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [41]:
#Conversion to numpy array
df.to_numpy()

array([[-1.26255498, -0.69484625,  1.16684579,  0.71016614],
       [ 0.44719522,  0.92312394, -0.64486311,  0.58353442],
       [-1.09350873, -1.51649313, -1.0630322 , -0.62370727],
       [ 0.29601396,  0.12148275, -1.08582877, -1.14475499],
       [-2.14010844,  1.36438889,  0.71025066, -0.75722588],
       [ 1.23102599, -2.01513726,  1.03609326, -0.02795175]])

#### Difference between NumPy and Pandas

NumPy is a more basic library that provides the building blocks of array based data manipulation. Pandas works to provide a streamlined way to use this functionality.

They are not alternatives like many people seem to think, and I highly recommend using both. NumPy is essential for all work you do with data, and arrays appear to work much faster then pandas DataFrames. Also, packages like tensorflow -for AI- can only be fed arrays and not pandas objects.

#### Basic statistics

In [42]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.420323,-0.302914,0.019911,-0.20999
std,1.273803,1.343226,1.064143,0.755503
min,-2.140108,-2.015137,-1.085829,-1.144755
25%,-1.220293,-1.311081,-0.95849,-0.723846
50%,-0.398747,-0.286682,0.032694,-0.32583
75%,0.4094,0.722714,0.954633,0.430663
max,1.231026,1.364389,1.166846,0.710166


In [43]:
#transposing your data
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.262555,0.447195,-1.093509,0.296014,-2.140108,1.231026
B,-0.694846,0.923124,-1.516493,0.121483,1.364389,-2.015137
C,1.166846,-0.644863,-1.063032,-1.085829,0.710251,1.036093
D,0.710166,0.583534,-0.623707,-1.144755,-0.757226,-0.027952


## Basic commands - Part 2.

In [44]:
#select one column
df['A']

2013-01-01   -1.262555
2013-01-02    0.447195
2013-01-03   -1.093509
2013-01-04    0.296014
2013-01-05   -2.140108
2013-01-06    1.231026
Freq: D, Name: A, dtype: float64

In [45]:
#Slicing
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.262555,-0.694846,1.166846,0.710166
2013-01-02,0.447195,0.923124,-0.644863,0.583534
2013-01-03,-1.093509,-1.516493,-1.063032,-0.623707


#### Selection by label

In [46]:
#For getting a cross section using a label:
df.loc[dates[0]]

A   -1.262555
B   -0.694846
C    1.166846
D    0.710166
Name: 2013-01-01 00:00:00, dtype: float64

In [47]:
#Selecting on a multi-axis by label:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-1.262555,-0.694846
2013-01-02,0.447195,0.923124
2013-01-03,-1.093509,-1.516493
2013-01-04,0.296014,0.121483
2013-01-05,-2.140108,1.364389
2013-01-06,1.231026,-2.015137


#### Selection by position

In [52]:
#Select via the position of the passed integers
print(df.iloc[3])
print()
print(df.iloc[4])
print()
print(df.iloc[0])

A    0.296014
B    0.121483
C   -1.085829
D   -1.144755
Name: 2013-01-04 00:00:00, dtype: float64

A   -2.140108
B    1.364389
C    0.710251
D   -0.757226
Name: 2013-01-05 00:00:00, dtype: float64

A   -1.262555
B   -0.694846
C    1.166846
D    0.710166
Name: 2013-01-01 00:00:00, dtype: float64
