# Introduction to Pandas

pandas is a *fast, powerful, flexible* and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
 
 In particular, it offers *data structures and operations for manipulating numerical tables and time series.* It is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals

## Installing Pandas

There are multiple ways to install pandas. Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development version are provided [Here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)

### Installing using pip
```python
pip install pandas
```

For other ways to install including Condas you can refer to the link mentioned above.

### Introduction to Pandas Data Structures

The two primary data structures of pandas, **Series (1-dimensional) and DataFrame (2-dimensional)**, handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

The best way to think about the pandas data structures is as **flexible containers for lower dimensional data**. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

### Sources
[Pandas Official](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)

### Importing Package

In [3]:
import pandas as pd
import numpy as np

### Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -1.227954
b    0.260531
c    1.343155
d    0.225123
e   -0.458618
dtype: float64

Creating from dictionary

In [7]:
d = {'b': 1, 'a': 0, 'c': 2}

pd.Series(d)

b    1
a    0
c    2
dtype: int64

Creating from a scalar value

In [8]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [9]:
d = {'b': 1, 'a': 0, 'c': 2}

s = pd.Series(d)

s[0]

1

You can get and set values by index label

In [10]:
s['a']

0

In [11]:
s['b'] = 5

In [12]:
s

b    5
a    0
c    2
dtype: int64

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas

In [13]:
s + s

b    10
a     0
c     4
dtype: int64

In [14]:
s*2

b    10
a     0
c     4
dtype: int64

In [15]:
np.sqrt(s)

b    2.236068
a    0.000000
c    1.414214
dtype: float64

### DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Creating from Dicts

In [16]:
d = {'one': [1., 2., 3., 4.],
   'two': [4., 3., 2., 1.]}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


Adding An Index

In [17]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


Creating from Arrays

In [22]:
d = [[1,2,3,4] , [5,6,7,8]]
df= pd.DataFrame(d , columns=['a' , 'b','c' , 'd'] , index=['x' , 'y'])
df

Unnamed: 0,a,b,c,d
x,1,2,3,4
y,5,6,7,8


###  Dataframe Operations

Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [23]:
df['a']

x    1
y    5
Name: a, dtype: int64

Accessing using Index 

In [26]:
df.loc['x']

a    1
b    2
c    3
d    4
Name: x, dtype: int64

Creating new column

In [29]:
df['three'] = df['a'] * df['b']
df

Unnamed: 0,a,b,c,d,three
x,1,2,3,4,2
y,5,6,7,8,30


Deleting columns

In [30]:
del df['a']
df.pop('b')
df

Unnamed: 0,c,d,three
x,3,4,2
y,7,8,30


### Indexing and Selection


|Operation | Syntax | Result|
|---|---|---|
|Select column| df[col]| Series|
|Select row by label|df.loc[label]|Series|
|Select row by integer location|df.iloc[loc]|Series|
|Slice rows|df[5:10]|DataFrame|
|Select rows by boolean vector|df[bool_vec]|DataFrame|

In [31]:
df['c']

x    3
y    7
Name: c, dtype: int64

In [32]:
df.loc['y']

c         7
d         8
three    30
Name: y, dtype: int64

In [33]:
df.iloc[1]

c         7
d         8
three    30
Name: y, dtype: int64

In [35]:
df[0:1]

Unnamed: 0,c,d,three
x,3,4,2


In [38]:
df[df > 2]

Unnamed: 0,c,d,three
x,3,4,
y,7,8,30.0


Transpose of a dataframe

In [39]:
df.T

Unnamed: 0,x,y
c,3,7
d,4,8
three,2,30


### Dataframe information

We can get information about the dataframe structure and data types using the df.info() method

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, x to y
Data columns (total 3 columns):
c        2 non-null int64
d        2 non-null int64
three    2 non-null int64
dtypes: int64(3)
memory usage: 144.0+ bytes


We can get quick statistics about each column of the data frame

In [41]:
df.describe()

Unnamed: 0,c,d,three
count,2.0,2.0,2.0
mean,5.0,6.0,16.0
std,2.828427,2.828427,19.79899
min,3.0,4.0,2.0
25%,4.0,5.0,9.0
50%,5.0,6.0,16.0
75%,6.0,7.0,23.0
max,7.0,8.0,30.0


### Iterating a dataframe

Iterating over columns

In [44]:
for column_name in df:
    print(column_namme)
    print('------\n')

c
------

d
------

three
------



If you want you can directly iterate over both the column name and the data

In [48]:
for column_name, col_data in df.iteritems():
    print(column_name)
    print(col_data)
    print("--------")

c
x    3
y    7
Name: c, dtype: int64
--------
d
x    4
y    8
Name: d, dtype: int64
--------
three
x     2
y    30
Name: three, dtype: int64
--------


Iterating over rows

In [49]:
for index, row in df.iterrows():
    print(index)
    print(row)
    print('---------')

x
c        3
d        4
three    2
Name: x, dtype: int64
---------
y
c         7
d         8
three    30
Name: y, dtype: int64
---------


## NOTE:

Iteration in pandas is an anti-pattern, and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

An even better option than iterrows() is to use the **apply()** method, which applies a function along a specific axis (meaning, either rows or columns) of a DataFrame. Although apply() also inherently loops through rows, it does so much more efficiently than iterrows() by taking advantage of a number of internal optimizations, such as using iterators in Cython.

In [50]:
df.apply(lambda row: print(row), axis=1)

c        3
d        4
three    2
Name: x, dtype: int64
c         7
d         8
three    30
Name: y, dtype: int64


x    None
y    None
dtype: object

**Vectorization** is the process of executing operations on entire arrays.

Pandas includes a generous collection of vectorized functions for everything from mathematical operations to aggregations and string functions (for an extensive list of available functions, check out the Pandas docs). 

The built-in functions are optimized to operate specifically on Pandas series and DataFrames. As a result, using vectorized Pandas functions is almost always preferable to accomplishing similar ends with custom-written looping.