# Pandas

In [None]:
Introduction to Pandas
Exploring Pandas Series
Introduction to Pandas DataFrame
Implementing Basic DataFrame Functionalities
Importing & Exporting Data

### Introduction to Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In [None]:
1. A fast and efficient DataFrame object for data manipulation.
2. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, 
    Microsoft Excel, SQL databases.
3. Intelligent data alignment and integrated handling of missing data:easily manipulate messy data into an orderly form.
4. Flexible reshaping and pivoting of data sets.
5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
6. Columns can be inserted and deleted from data structures for size mutability.

### Features

In [None]:
1. Data Wrangling
2. Modeling
3. Visualization
4. High Performance Computing
5. Text Processing
6. Statistical Computing
7. Numerical Computing

### Exploring Pandas Series

In [None]:
Creating Pandas Series from Numpy Array & Dictionary

In [1]:
import numpy as np
import pandas as pd

In [5]:
print(pd.__version__)

1.4.2


In [7]:
data=np.random.randint(10,30,10)
data

array([29, 16, 11, 23, 19, 12, 27, 27, 24, 14])

In [112]:
type(data)

numpy.ndarray

In [13]:
# Creating Series
series=pd.Series(data)
series

0    29
1    16
2    11
3    23
4    19
5    12
6    27
7    27
8    24
9    14
dtype: int32

In [4]:
type(series)

pandas.core.series.Series

### Operations in Pandas Series

In [14]:
# To Search a value from Index 
series[4] 

19

In [15]:
# Search a value greater than 20
series[series>20] 

0    29
3    23
6    27
7    27
8    24
dtype: int32

### Introduction to Pandas DataFrame

Creating Dataframe from Dictionary of Pandas Series

In [154]:
data={'a':pd.Series(np.random.randint(10,20,5)),
      'b':pd.Series(np.random.randint(5,8,5)),
      'c':pd.Series(5,index=[0,1,2])}
data

{'a': 0    10
 1    14
 2    16
 3    19
 4    10
 dtype: int32,
 'b': 0    7
 1    5
 2    5
 3    7
 4    7
 dtype: int32,
 'c': 0    5
 1    5
 2    5
 dtype: int64}

In [155]:
type(data)

dict

In [156]:
# Converting Dictionary of Pandas Series to DataFarme
df=pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,10,7,5.0
1,14,5,5.0
2,16,5,5.0
3,19,7,
4,10,7,


In [2]:
data2 = [{'p': 2, 'q': 4}, {'p': 5, 'q': 10, 'r': 15}]

In [3]:
pd.DataFrame(data2)

Unnamed: 0,p,q,r
0,2,4,
1,5,10,15.0


In [4]:
pd.DataFrame(data2, index=['IIT', 'Academy'])

Unnamed: 0,p,q,r
IIT,2,4,
Academy,5,10,15.0


In [111]:
pd.DataFrame(data2, columns=['p', 'q'])

Unnamed: 0,p,q
0,2,4
1,5,10


### Basic Functionality / Operations on DataFrame

In [24]:
# Extracting a column from data frame
n=df['a'] #for single column only
n

0    16
1    19
2    17
3    14
4    12
Name: a, dtype: int32

In [116]:
n1=df[['a','b']]
n1

Unnamed: 0,a,b
1,4,7
2,5,8
3,6,9


In [117]:
type(n1)

pandas.core.frame.DataFrame

In [25]:
type(n)

pandas.core.series.Series

In [27]:
# Extracting 2 ormore columns from dataframe
p=df[['a','b','c']] #for multiple column - Imp
p

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,
4,12,7,


In [28]:
type(p)

pandas.core.frame.DataFrame

In [29]:
df=pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,
4,12,7,


In [None]:
df.loc[2:,]

In [31]:
# Extract an element from dataframe using Row Index
df.loc[2] 

a    17.0
b     6.0
c     5.0
Name: 2, dtype: float64

In [34]:
# Extract an element from dataframe using Row Index and Column Index
df.loc[2,'b'] 

6

In [36]:
#Slicing multiple elements from dataframe using Row Index & Column Index. Here End Index is Included 
df.loc[2:3,'b':'c']

Unnamed: 0,b,c
2,6,5.0
3,5,


In [38]:
# Slice multiple elements from dataframe
df.loc[2:,]

Unnamed: 0,a,b,c
2,17,6,5.0
3,14,5,
4,12,7,


In [119]:
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[13, 14, 15],
columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
13,4,7,10
14,5,8,11
15,6,9,12


In [121]:
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[13, 14, 15])
df

Unnamed: 0,0,1,2
13,4,7,10
14,5,8,11
15,6,9,12


In [61]:
type(df)

pandas.core.frame.DataFrame

In [131]:
df1 = pd.DataFrame(
{"a" : [1 ,2, 3,7],
"b" : [4, 5, 6,8],
"c" : [7, 8, 9,9],
"d" : [1, 7, 9,0],
"e" : [8, 9, 9,1]
},
index = [10, 2, 3,4])
df1

Unnamed: 0,a,b,c,d,e
10,1,4,7,1,8
2,2,5,8,7,9
3,3,6,9,9,9
4,7,8,9,0,1


In [133]:
#Renaming a column in dataframe
df2=df1.rename(columns = {'c':'c1'})
df2

Unnamed: 0,a,b,c1,d,e
10,1,4,7,1,8
2,2,5,8,7,9
3,3,6,9,9,9
4,7,8,9,0,1


In [134]:
df1

Unnamed: 0,a,b,c,d,e
10,1,4,7,1,8
2,2,5,8,7,9
3,3,6,9,9,9
4,7,8,9,0,1


In [135]:
df1.rename(columns = {'c':'c1'}, inplace=True)

In [142]:
df1

Unnamed: 0,a,b,c1,d,e
10,1,4,7,1,8
2,2,5,8,7,9
3,3,6,9,9,9
4,7,8,9,0,1


In [139]:
#Order rows by values of a column (high to low).
df1.sort_values('c1',ascending=False)

Unnamed: 0,a,b,c1,d,e
3,3,6,9,9,9
4,7,8,9,0,1
2,2,5,8,7,9
10,1,4,7,1,8


In [140]:
# Select columns in positions 1, 2 and 5 (first column is 0).
df1.iloc[:,2:3]

Unnamed: 0,c1
10,7
2,8
3,9
4,9


In [99]:
# Select columns in positions 1, 2 and 5 (first column is 0).
df1.iloc[:,[1,2,3]]

Unnamed: 0,b,c1,d
1,4,7,1
2,5,8,7
3,6,9,9
4,8,9,0


In [144]:
# Select rows meeting logical condition, and only the specific columns.
df1.loc[df1['a'] > 5, ['a','c1']]

Unnamed: 0,a,c1
4,7,9


### Statistical Operations on Pandas Dataframe

In [145]:
df1

Unnamed: 0,a,b,c1,d,e
10,1,4,7,1,8
2,2,5,8,7,9
3,3,6,9,9,9
4,7,8,9,0,1


In [147]:
# Find Mean Column wise - Default Columnwise (axis=0)
df1.mean() 

a     3.25
b     5.75
c1    8.25
d     4.25
e     6.75
dtype: float64

In [148]:
# Find Mean Rowwise
df1.mean(axis=1) 

10    4.2
2     6.2
3     7.2
4     5.0
dtype: float64

In [149]:
# Find sum Columnwise
df1.sum()

a     13
b     23
c1    33
d     17
e     27
dtype: int64

In [150]:
# Find Sum rowise
df1.sum(axis=1) #rowwise sum

10    21
2     31
3     36
4     25
dtype: int64

### Find Missing Values in DataFrame

In [157]:
df.isnull()

Unnamed: 0,a,b,c
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,True
4,False,False,True


In [158]:
df

Unnamed: 0,a,b,c
0,10,7,5.0
1,14,5,5.0
2,16,5,5.0
3,19,7,
4,10,7,


In [159]:
df.isnull().sum()

a    0
b    0
c    2
dtype: int64

In [49]:
# Drop columns with Missing Values
df.dropna(axis=1) 

Unnamed: 0,a,b
0,16,7
1,19,5
2,17,6
3,14,5
4,12,7


In [50]:
# Drop rows with Missing Values
df.dropna(axis=0)

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0


In [51]:
# Fill NA values with 0 --> Fill happens on original df
df.fillna(0) 

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,0.0
4,12,7,0.0


In [52]:
# Filling NA by Forward Fill --> Fill happens on original df
df.fillna(method='ffill')

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,5.0
4,12,7,5.0


In [53]:
# Fill na with backward --> Fill happens on original df
df.fillna(method='bfill') 

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,
4,12,7,


In [54]:
# Fill na with mean
df.fillna(df.mean()) 

Unnamed: 0,a,b,c
0,16,7,5.0
1,19,5,5.0
2,17,6,5.0
3,14,5,5.0
4,12,7,5.0


### Importing & Exporting Data

Reading Excel File

In [161]:
data=pd.read_excel('iris.xlsx')
data

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [163]:
data.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [164]:
data.tail(10)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
140,6.7,3.1,5.6,2.4,virginica
141,6.9,3.1,5.1,2.3,virginica
142,5.8,2.7,5.1,1.9,virginica
143,6.8,3.2,5.9,2.3,virginica
144,6.7,3.3,5.7,2.5,virginica
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


Exporting the DataFrame as a Excel file

In [165]:
data.to_excel('irisnew.xlsx')