###Python Pandas
* An open souce python library providing high-performance data manipulation and 
analysis tool using its powerful data structures.
* Developed by Wes McKinney in 2008.
* Used in wide range of fields including academic and commercial domains including finance, economics, statistics, analytics etc.


### Use the following pip command to install pandas
 > **pip install pandas**


---



####Details
* Deals with three data data structures.
1. > Series(One dimensional array like structure with homogenous data)
2. > DataFrame(Two dimensional array with homogenous data)
3. > Panels(Three dimensional data structure with heterogenous data)



####***a***. Series
Syntax: pandas.Series( data, index, dtype, copy)


In [1]:
#Creating an Empty Series
import pandas as pd
s = pd.Series()
print (s)


Series([], dtype: float64)


  This is separate from the ipykernel package so we can avoid doing imports until


In [2]:
#Creating a Series from ndarray
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)

0    a
1    b
2    c
3    d
dtype: object


In [3]:
#Creating a Series from ndarray providing the index
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print (s)

#If any value is missing then index order is persisted and the missing element is filled with NaN(Not a Number), See the reading for better insight.

100    a
101    b
102    c
103    d
dtype: object


In [4]:
#Create a series from dict
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)
#If any value is missing then index order is persisted and the missing element is filled with NaN(Not a Number), See the reading for better insight.

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


In [5]:
#Create a series from Scalar
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0    5
1    5
2    5
3    5
dtype: int64


In [6]:
#Selection in the series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
s.loc[100]
# s.iloc[2]
# s[100]

'a'

In [7]:
#Slicing the Series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
s[0:3]

100    a
101    b
102    c
dtype: object

####**b**. Dataframe
The topic we will keep our focus.


**1. Creating an empty dataframe**

In [8]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []


**2. Create a DataFrame from lists**

In [9]:
import pandas as pd
#list consists of numbers
data_second = [1,2,3,4,5]
df2 = pd.DataFrame(data_second)
print(df2)

   0
0  1
1  2
2  3
3  4
4  5


In [10]:
import pandas as pd
#list consits of characters
data_first = ['a','b','c','d','e']
df1 = pd.DataFrame(data_first)
print (df1)


   0
0  a
1  b
2  c
3  d
4  e


In [11]:
import pandas as pd
#list of list to get familiar with multiple columns 
data = [['Jon', 6],['Sansa', 7],['Arya', 9]]
df = pd.DataFrame(data)
print (df)

       0  1
0    Jon  6
1  Sansa  7
2   Arya  9


In [12]:
import pandas as pd
data = [['Jon', 6],['Sansa', 7],['Arya', 9]]
#Adding parameters to the constructors to understand the data
df = pd.DataFrame(data,columns=['Name','Rating'], index = ['rank3', 'rank2', 'rank1'], dtype=float)
print (df)

        Name  Rating
rank3    Jon     6.0
rank2  Sansa     7.0
rank1   Arya     9.0


**3. Column Operations**
* Selection
* Addition
* Deletion

In [13]:
# import pandas as pd
#Here we are using previous cells dataframe and performing the operations
print ("Adding a new column using the existing columns in DataFrame:")
df['Gender'] = pd.Series(['Male', 'Female', 'Female'], index = ['rank3', 'rank2', 'rank1'])
print(df)

Adding a new column using the existing columns in DataFrame:
        Name  Rating  Gender
rank3    Jon     6.0    Male
rank2  Sansa     7.0  Female
rank1   Arya     9.0  Female


In [14]:
print ("Deleting the second column using DEL function:")
del df['Rating']
print(df)

Deleting the second column using DEL function:
        Name  Gender
rank3    Jon    Male
rank2  Sansa  Female
rank1   Arya  Female


In [15]:
print ("Deleting another column using POP function:")
df.pop('Gender')
print (df)

Deleting another column using POP function:
        Name
rank3    Jon
rank2  Sansa
rank1   Arya


**3. Row Operations**
* Selection
* Addition
* Deletion

In [16]:
import pandas as pd
data = [['Ross', 'Paleontologist'],['Racheal', 'Waitress'],['Chandler', 'Banker'],['Monica', 'Chef'], ['Joey', 'Actor' ], ['Phoebe', 'Massuse' ]]
df = pd.DataFrame(data,columns=['Name','Occupation'], index = ['a', 'b','c', 'd', 'e', 'f'])
print (df)

       Name      Occupation
a      Ross  Paleontologist
b   Racheal        Waitress
c  Chandler          Banker
d    Monica            Chef
e      Joey           Actor
f    Phoebe         Massuse


In [17]:
#Selection by interger
df.iloc[1]

Name           Racheal
Occupation    Waitress
Name: b, dtype: object

In [18]:
#Selection by index
df.loc['f']

Name           Phoebe
Occupation    Massuse
Name: f, dtype: object

In [19]:
#slicing rows
print(df[2:4])

       Name Occupation
c  Chandler     Banker
d    Monica       Chef


In [20]:
#print is not compulsory to display the data simply the name works 
df[2:4]

Unnamed: 0,Name,Occupation
c,Chandler,Banker
d,Monica,Chef


In [21]:
#Addition of rows
df2 = pd.DataFrame([['Gunther', 'Waiter'],], columns = ['Name','Occupation'], index = ['g'])
df = df.append(df2)
df


Unnamed: 0,Name,Occupation
a,Ross,Paleontologist
b,Racheal,Waitress
c,Chandler,Banker
d,Monica,Chef
e,Joey,Actor
f,Phoebe,Massuse
g,Gunther,Waiter


In [22]:
#Deletion of Rows
df.drop('g')

Unnamed: 0,Name,Occupation
a,Ross,Paleontologist
b,Racheal,Waitress
c,Chandler,Banker
d,Monica,Chef
e,Joey,Actor
f,Phoebe,Massuse


#### Reading a csv files and excel files in the form of dataframe using panda

A comma-separated values file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data in plain text, in which case each line will have the same number of fields. The CSV file format is not fully standardized. 

In [23]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/wine-quality(red).txt', sep = ';')
df


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


In [24]:
df[['fixed acidity', 'volatile acidity']]

Unnamed: 0,fixed acidity,volatile acidity
0,7.4,0.7
1,7.8,0.88
2,7.8,0.76
3,11.2,0.28
4,7.4,0.7
5,7.4,0.66


In [25]:
df.loc[1:3, 'fixed acidity':'residual sugar' ]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar
1,7.8,0.88,0.0,2.6
2,7.8,0.76,0.04,2.3
3,11.2,0.28,0.56,1.9


In [26]:
df.iloc[0:3, 0:3]

Unnamed: 0,fixed acidity,volatile acidity,citric acid
0,7.4,0.7,0.0
1,7.8,0.88,0.0
2,7.8,0.76,0.04


In [27]:
df['fixed acidity']

0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
5     7.4
Name: fixed acidity, dtype: float64

In [28]:
#Removing coloumns using drop 
df.drop(['density', 'pH'], axis = 1, inplace = True)
df


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13,40,0.56,9.4,5


For working of pandas with excel files please go through the reading section.


In [29]:
 #Add coloumn to dataframe in Pandas(based on other column or list or default value)
 df['density'] =pd.Series({ 1 : 3.2, 3 : 3.1, 4 : 3.6}, index = [0,1,2,3,4,5])
 df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,sulphates,alcohol,quality,density
0,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5,
1,7.8,0.88,0.0,2.6,0.098,25,67,0.68,9.8,5,3.2
2,7.8,0.76,0.04,2.3,0.092,15,54,0.65,9.8,5,
3,11.2,0.28,0.56,1.9,0.075,17,60,0.58,9.8,6,3.1
4,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5,3.6
5,7.4,0.66,0.0,1.8,0.075,13,40,0.56,9.4,5,


In [36]:
#Drop all the coloumn with NaN Value
df.dropna()
#Drop rows with NAN value
# df.dropna(axis = 'rows')
#Drop coloumns with NAN value
# df.dropna(axis = 'columns', inplace = True)


In [37]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13,40,0.56,9.4,5
