# Pandas
Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool

Tutorial reference - https://www.tutorialspoint.com/python_pandas


### Pandas is Excel of python

![image](media/excel.jpg)

image reference - https://www.addictivetips.com/microsoft-office

### Pandas deals with the following data structures −

- Series
- DataFrame

Data Structure | Dimensions	| Description
--- | --- | ---
Series | 1	| 1D labeled homogeneous array, sizeimmutable.
Data Frames	| 2	| General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

### Series

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

10 | 23	| 56 | 17 | 52	| 61	| 73	| 90	| 26	| 72

- Homogeneous data

### DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,

Name | Age	| Gender	| Rating
--- | --- | --- | ---
Steve	| 32 | Male	| 3.45
Lia	| 28	| Female	| 4.6
Vin	| 45	| Male	| 3.9
Katie | 38	| Female	| 2.78

### Data Type of Columns
The data types of the four columns are as follows −

Column	| Type
--- | ---
Name | String
Age	| Integer
Gender	| String
Rating	| Float


### Install pandas from notebook
Execute the below command to install

In [None]:
%%sh
pip install pandas
pip install numpy

## Lets dive into handling data
We will import csv file (which we generally get from client or any prrepared problem)

#### Import pandas

You have to import python package to use it. execute below cell to import pandas and other necessary packages

In [2]:
import pandas as pd
import numpy as np

#### 1. Loading CSV

syntax:
- df=pd.read_csv(filepath,**args)

In [56]:
data=pd.read_csv('data/titanic_train.csv')

### 2. Quickly Viewing few records of data

syntax:
- df.head(number of data from top)
- df.tail(number of data from bottom)

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 3. Quick data statistics

syntax:
- df.describe()

In [6]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### Extras

In [20]:
# 3.1 to get all columns in the data
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [23]:
# 3.2 get some info about each column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


### Digging deeper into the data

1. lets see what was the survival ratio of male and female

In [62]:
# selecting only one column
data['Sex']

0        male
1      female
2      female
3      female
4        male
5        male
6        male
7        male
8      female
9      female
10     female
11     female
12       male
13       male
14     female
15     female
16       male
17       male
18     female
19     female
20       male
21       male
22     female
23       male
24     female
25     female
26       male
27       male
28     female
29       male
        ...  
861      male
862    female
863    female
864      male
865    female
866    female
867      male
868      male
869      male
870      male
871    female
872      male
873      male
874    female
875    female
876      male
877      male
878      male
879    female
880    female
881      male
882    female
883      male
884      male
885    female
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

In [52]:
#applying filter
data[data['Sex']=='female']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0000,,S


In [41]:
data[(data['Sex']=='female') & (data['Survived']==1)]

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,233.0,233.0,233.0,197.0,233.0,233.0,233.0
mean,429.699571,1.0,1.918455,28.847716,0.515021,0.515021,51.938573
std,255.048296,0.0,0.834211,14.175073,0.737533,0.820527,64.102256
min,2.0,1.0,1.0,0.75,0.0,0.0,7.225
25%,238.0,1.0,1.0,19.0,0.0,0.0,13.0
50%,400.0,1.0,2.0,28.0,0.0,0.0,26.0
75%,636.0,1.0,3.0,38.0,1.0,1.0,76.2917
max,888.0,1.0,3.0,63.0,4.0,5.0,512.3292


In [43]:
#survived
233/314

0.7420382165605095

In [53]:
# Advanced way to calculate percentage of female survived
data[['Sex','Survived']].groupby('Sex').mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


2. 