## Pandas Tutorial

--> Pandas is an open-source library that is built on top of NumPy library.

--> It is a Python package that offers various data structures and operations for manipulating numerical data and time series.

--> It is mainly popular for importing and analyzing data much easier.

--> Pandas is fast and it has high-performance & productivity for users.

##             Introduction


--> Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. 

                                                            Advantages
                             
--> Fast and efficient for manipulating and analyzing data.

--> Data from different file objects can be loaded.

--> Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

--> Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

--> Data set merging and joining.

-> Flexible reshaping and pivoting of data sets

--> Provides time-series functionality.

--> Powerful group by functionality for performing split-apply-combine operations on data sets.

                                                         Why Pandas is used for Data Science
                                                         
--> Pandas is generally used for data science but have you wondered why? 

--> This is because pandas is used in conjunction with other libraries that are used for data science.

--> It is built on the top of the NumPy library which means that a lot of structures of NumPy are used or replicated in Pandas. 

--> The data produced by Pandas is often used as input for plotting functions of $Matplotlib$, statistical analysis in $SciPy$, machine learning algorithm in $Scikit-learn$.

--> Pandas program can be run from any text editor but it is recommended to use Jupyter Notebook for this as Jupyter given the ability to execute code in a particular cell rather than executing the entire file.

--> Jupyter also provides an easy way to visualize pandas dataframe and plots.

                                                                   Getting Started
                                           
--> After the pandas has been installed into the system, you need to import the library. 

--> This module is generally imported as:–

                           import pandas as pd
                           
--> Pandas generally provide two data structure for manipulating data, They are:

        1. Series
         
        2. DataFrame                           

## 1.Pandas Series

--> Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 

--> The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

--> Labels need not be unique but must be a hashable type. 

--> The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

                                                        Creating a Pandas Series
                                                        
                                                        
--> In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. 

Pandas Series can be created from the lists, dictionary, and from a scalar value etc. 

Series can be created in different ways, here are some ways by which we create a series: 

                Creating a series from array: 

--> In order to create a series from array, we have to import a numpy module and have to use array() function.

In [None]:
# importing pandas library
import pandas as pd

#creating pandas empty series
series=pd.Series()
print('Empty series :',series)

# importing numpy Library
import numpy as np
 
# simple array
data = np.array(['p','a','n','d','a','s'])
 
ser = pd.Series(data)
print('Pandas 1st series\n',ser)

              Creating a series from array with index :-
              
--> In order to create a series from array with index, we have to provide index with same number of element as it is in array.

In [None]:
# simple array
data = np.array(['A','m','i','y','a'])
  
# providing an index
series = pd.Series(data, index =[10, 11, 43, 13,56])
print(series)

            Creating a Pandas Series from Lists:-
            
            
--> In order to create a series from list, we have to first create a list after that we can create a series from list.

    Method #1 :-

--> Using Series() method without any argument.

In [None]:
 # create Pandas Series with default index values
# default index ranges is from 0 to len(list) - 1
x = pd.Series(['Ankit','Arka','Ayas','Lala'])
  
# print the Series
print(x)

 
    Method #2 :-

--> Using Series() method with 'index' argument.

In [None]:
#create Pandas Series with define indexes
x = pd.Series([10, 20, 30, 40, 50], index =['a', 'b', 'c', 'd', 'e'])
  
# print the Series
print(x)

In [None]:
#Anaother Example

ind = [28, 17,21,57,59]
  
lst = ['Lala','Ankit','Arka','Dipanshu','Ganesh']
              
# create Pandas Series with define indexes
x = pd.Series(lst, index = ind)
  
# print the Series
print(x)

    Method #3:-
    
--> Using Series() method with multi-list

In [None]:
  
# multi-list
list = [ ['Trident'], ['CV Raman'], ['Silicon'], ['ITER'],
         ['KIIT'], ['NIT'], ['IIT'], ['GIET'] ]
           
# create Pandas Series
df = pd.Series((i[0] for i in list))
  
print(df)

                              Creating a series from Dictionary:-

--> In order to create a series from dictionary, we have to first create a dictionary after that we can make a series using dictionary. 

--> Dictionary key are used to construct a index.

     Using Series() method without index parameter:-
     
--> In this case, dictionary keys are taken in a sorted order to construct index.

    Code #1 :-

--> Dictionary keys are given in sorted order.

In [None]:
# create a dictionary
dictionary = {'A' : 10, 'D' : 20, 'C' : 30,'B':50}
  
# create a series
series = pd.Series(dictionary)
  
print(series)

    Code #2 :-

--> Dictionary keys are given in unsorted order.

In [None]:
# create a dictionary
dictionary = {'D' : 10, 'B' : 20, 'C' : 30}
  
# create a series
series = pd.Series(dictionary)
  
print(series)

                   Using Series() method with index parameter:-
                   
--> In this case, the values in data corresponding to the labels in the index will be assigned.

     Code #1 : -

--> Index list is passed of same length as the number of keys present in dictionary.

In [None]:
# create a dictionary
dictionary = {'A' : 50, 'B' : 10, 'C' : 80}
  
# create a series
series = pd.Series(dictionary, index =['B', 'C', 'A'])
  
print(series)

      Code #2 :-

--> Index list is passed of greater length than the number of keys present in dictionary in this case, Index order is persisted and the missing element is filled with NaN (Not a Number).

In [None]:
  
# create a dictionary
dictionary = {'A' : 50, 'B' : 10, 'C' : 80}
  
# create a series
series = pd.Series(dictionary, index =['B', 'C', 'D', 'A'])
  
print(series)

                                  Creating a series from Scalar value:-

--> In order to create a series from scalar value, an index must be provided. 

--> The scalar value will be repeated to match the length of index.

In [None]:
# giving a scalar value with index
ser = pd.Series(10, index =[0, 1, 2, 3, 4, 5])
  
print(ser)

                                    Creating a series using NumPy functions :-
                                    
                                    
In order to create a series using numpy function, we can use different function of numpy like 

$numpy.linspace()$,

$numpy.random.radn().$

     Code #1:-

--> Using $numpy.linspace()$

In [None]:
# series with numpy linspace() 
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)
  
# series with numpy linspace()
ser2 = pd.Series(np.linspace(1, 100, 7))
print("\n", ser2)
  

      Code #2:-

--> Using $np.random.normal()$ and $random.rand() method.$



In [None]:
# series with numpy random.normal
ser3 = pd.Series(np.random.normal())
print(ser3)
  
# series with numpy random.normal
ser4 = pd.Series(np.random.normal(0.0, 1.0, 5))
print("\n", ser4)
  
# series with numpy random.rand
ser5 = pd.Series(np.random.rand(10))
print("\n", ser5)

     Code #3:-

--> Using $numpy.repeat()$

In [None]:
# series with numpy random.repeat
ser = pd.Series(np.repeat(0.08, 7))
print("\n", ser)

                                                          Accessing element of Series
                                                          


--> There are two ways through which we can access element of series, they are :

                 Accessing Element from Series with Position
                 
                 Accessing Element Using Label (index)
                 
                 
$Accessing$  $Element$  $from$  $Series$  $with$  $Position$  $:-$

--> In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series.

--> The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

--> In order to access multiple elements from a series, we use Slice operation.

--> Slice operation is performed on Series with the use of the colon(:). 

--> To print elements from beginning to a range use $[:Index]$, to print elements from end-use $[:-Index]$, to print elements from specific Index till the end use $[Index:]$, to print elements within a range, use $[Start Index:End Index]$ and to print whole Series with the use of slicing operation, use$ [:]$. Further, to print the whole Series in reverse order, use $[::-1]$.
 
     Code #1:-
     
--> Accessing a first element of series

In [None]:
# creating simple array
data = np.array(['p','a','n','d','a','s','l','i','b','r','a','r','y'])
ser = pd.Series(data)
  
  
#retrieve the first element
print(ser[:5])

# retrieve the first element
print(ser[0])

#Accessing last 10 elements of Series
print(ser[-10:])

    code2:-
    
--> Accessing first 5 elements of Series in $pd1.csv$ file

In [None]:
# making data frame  
df = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv")  
    
ser = pd.Series(df['Name']) 
ser.head(10) 
ser[:5]

$Accessing$  $Element$  $Using$  $Label$  $(index)$ $:-$

--> In order to access an element from series, we have to set values by index label.

--> A Series is like a fixed-size dictionary in that you can get and set values by index label.

     Code #1:-
     
--> Accessing a single element using index label

In [None]:
data = np.array(['p','a','n','d','a','s','l','i','b','r','a','r','y'])
ser = pd.Series(data, index =[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
   
   
# accessing a element using index element
print(ser[16])

 
     Code #2:-
     
--> Accessing a multiple element using index label

In [None]:
# accessing a multiple element using 
# index element
print(ser[[10, 11, 12, 20, 14]])

In [None]:
    
ser = pd.Series(np.arange(3, 9), index =['a', 'b', 'c', 'd', 'e', 'f']) 
    
print(ser[['a', 'd']])

     Code #3:-
     
--> Accessing a multiple element using index label in nba.csv file

In [None]:
# making data frame  
df = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv")  
    
ser = pd.Series(df['Name']) 
ser.head(10) 
ser[[0, 3, 6, 9]]

                                                        Indexing and Selecting Data in Series

--> Indexing in pandas means simply selecting particular data from a Series. 

--> Indexing could mean selecting all the data, some of the data from particular columns. 

--> Indexing can also be known as Subset Selection.

    Indexing a Series using indexing operator [] :
    
--> Indexing operator is used to refer to the square brackets following an object. 

--> The $.loc$ and $.iloc$ indexers also use the indexing operator to make selections.

--> In this indexing operator to refer to df[ ].



$Indexing$  $a$  $Series$  $using$  $.loc[ ]$  $:-$

This function selects data by refering the explicit index .

The $df.loc$  indexer selects data in a different way than just the indexing operator. It can select subsets of data.

           Syntax:- pandas.DataFrame.loc[]
           
           Parameters:-
           
$Index$ $label$ $:-$ String or list of string of index label of rows

$Return$  $type$  $:-$ Data frame or Series depending on parameters

     Example #1: Extracting single Row

--> In this example, Name column is made as the index column and then two single rows are extracted one by one in the form of series using index label of rows.

In [None]:
  
# making data frame from csv file
data = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Amitab"]
second = data.loc["Ayusman"]
  
  
print(first, "\n\n\n", second)

      Example #2: Multiple parameters

In this example, Name column is made as the index column and then two single rows are extracted at the same time by passing a list as parameter.

In [None]:
# retrieving rows by loc method
rows = data.loc[[ "Dipanshu","Arkasarathi"]] 
# checking data type of rows
print(type(rows))
  
# display
rows

       Example #3: Extracting multiple rows with same index

In this example, Team name is made as the index column and one team name is passed to $.loc$ method to check if all values with same team name have been returned or not.

In [None]:
df

In [None]:
# making data frame from csv file
data = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv", index_col ="Branch")
  
# retrieving rows by loc method
rows = data.loc["CSE"]
  
# checking data type of rows
print(type(rows))
  
# display
rows

      Example #4: Extracting rows between two index labels

In this example, two index label of rows are passed and all the rows that fall between those two index label have been returned (Both index labels Inclusive).

In [None]:
  
# making data frame from csv file
data = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv", index_col ="Name")
rows = data.loc["Ankit":"Amitab"]
  
# checking data type of rows
print(type(rows))
  
# display
rows  

$Indexing$ $a$ $Series$ $using$ $.iloc[ ]$ $:-$

T--> his function allows us to retrieve data by position. 

--> In order to do that, we’ll need to specify the positions of the data that we want.

--> The $df.iloc$ indexer is very similar to df.loc but only uses integer locations to make its selections.

     Syntax: pandas.DataFrame.iloc[]

    Parameters:
    
$Index$ $Position$ $:-$ Index position of rows in integer or list of integer.

$Return$  $type$ $:-$ Data frame or Series depending on parameters

Example #1: Extracting single row and comparing with .loc[]

In this example, same index number row is extracted by both $.iloc[]$ and $.loc[]$ method and compared. 

Since the index column by default is numeric, hence the index label will also be integers.

In [None]:
data = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv" ) 
# retrieving rows by loc method 
row1 = data.loc[3]
  
# retrieving rows by iloc method
row2 = data.iloc[3]
  
# checking if values are equal
row1 == row2

      Example #2: Extracting multiple rows with index

In this example, multiple rows are extracted first by passing a list and then by passing integers to extract rows between that range.

After that, both the values are compared.

In [None]:
# retrieving rows by loc method 
row1 = data.iloc[[4, 5, 6, 7]]
  
# retrieving rows by loc method 
row2 = data.iloc[4:8]
  
# comparing values
row1 == row2

     Difference between 
$loc[]$ and $.iloc[]$ :-

--> loc() is label based data selecting method which means that we have to pass the name of the row or column which we want to select. 

--> This method includes the last element of the range passed in it, unlike iloc().

--> loc() can accept the boolean data unlike iloc() .


                                                          Binary Operation on Series
                                                          
--> We can perform binary operation on series like addition, subtraction and many other operation. 

--> In order to perform binary operation on series we have to use some function like $.add()$, $.sub()$ etc..

In [None]:
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)

#Now we add two series using .add() function.

# adding two series using
# .add
print(data.add(data1, fill_value=0))

#Now we subtract two series using .sub function.

# subtracting two series using
# .sub
print(data.sub(data1, fill_value=0))

                                               Binary operation methods on series:

        FUNCTION                                   DESCRIPTION
add()-------------------------------------------Method is used to add series or list like objects with same length to the caller series

sub()------------------------------------------Method is used to subtract series or list like objects with same length from the caller series

mul()----------------------------------------------Method is used to multiply series or list like objects with same length with the caller series

div()--------------------------------------------Method is used to divide series or list like objects with same length by the caller series

sum()---------------------------------------------Returns the sum of the values for the requested axis

prod()-----------------------------------------Returns the product of the values for the requested axis

mean()-----------------------------------------Returns the mean of the values for the requested axis

pow()---------------------------------------Method is used to put each element of passed series as exponential power of caller series and returned the results

abs()--------------------------------------Method is used to get the absolute numeric value of each element in Series/DataFrame

cov()--------------------------------------Method is used to find covariance of two series

                                                      Conversion Operation on Series
                                                     
--> In conversion operation we perform various operation like changing datatype of series, changing a series to list etc.

--> In order to perform conversion operation we have various function which help in conversion like $.astype()$, $.tolist()$ etc.

In [None]:
# Python program using astype
# to convert a datatype of series
 
# importing pandas module  
import pandas as pd 
   
# reading csv file from url  
data = pd.read_csv(r"C:\Users\ASUS\Documents\Jupyter lab Practice\pd1.csv" ) 
    
# dropping null value columns to avoid errors 
data.dropna(inplace = True) 
   
# storing dtype before converting 
before = data.dtypes 
   
# converting dtypes using astype 
data["Roll No"]= data["Roll No"].astype(int) 
data["Regd.No"]= data["Regd.No"].astype(str) 
   
# storing dtype after converting 
after = data.dtypes 
   
# printing to compare 
print("BEFORE CONVERSION\n", before, "\n") 
print("AFTER CONVERSION\n", after, "\n") 


In [43]:
     
# removing null values to avoid errors  
data.dropna(inplace = True)  
   
# storing dtype before operation 
dtype_before = type(data["Roll No"]) 
   
# converting to list 
salary_list = data["Roll No"].tolist() 
   
# storing dtype after operation 
dtype_after = type(salary_list) 
   
# printing dtype 
print("Data type before converting = {}\nData type after converting = {}"
      .format(dtype_before, dtype_after)) 
   
# displaying list 
salary_list 

Data type before converting = <class 'pandas.core.series.Series'>
Data type after converting = <class 'list'>


[17,
 38,
 28,
 45,
 34,
 29,
 57,
 39,
 31,
 55,
 21,
 22,
 42,
 54,
 15,
 13,
 11,
 75,
 59,
 67,
 30]

 
                                                          Pandas series method:
 
     FUNCTION                          DESCRIPTION
     
Series()-------------------------------A pandas Series can be created with the Series() constructor method. This constructor method accepts a variety of inputs

combine_first()-----------------------Method is used to combine two series into one

count()--------------------------------Returns number of non-NA/null observations in the Series

size()------------------------------Returns the number of elements in the underlying data

name()------------------------------Method allows to give a name to a Series object, i.e. to the column

is_unique()--------------------------Method returns boolean if values in the object are unique

idxmax()-----------------------------Method to extract the index positions of the highest values in a Series

idxmin()-------------------------------Method to extract the index positions of the lowest values in a Series

sort_values()--------------------------Method is called on a Series to sort the values in ascending or descending order

sort_index()---------------------------Method is called on a pandas Series to sort it by the index instead of its values

head()----------------------------------Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series

tail()-------------------------------------Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series

le()-----------------------------------Used to compare every element of Caller series with passed series.It returns True for every element which is Less than or Equal to the element in passed series

ne()-----------------------------------Used to compare every element of Caller series with passed series. It returns True for every element which is Not Equal to the element in passed series

ge()----------------------------------Used to compare every element of Caller series with passed series. It returns True for every element which is Greater than or Equal to the element in passed series

eq()-------------------------------------Used to compare every element of Caller series with passed series. It returns True for every element which is Equal to the element in passed series

gt()------------------------------------Used to compare two series and return Boolean value for every respective element

lt()------------------------------------Used to compare two series and return Boolean value for every respective element

clip()------------------------------------Used to clip value below and above to passed Least and Max value

clip_lower()-------------------------------Used to clip values below a passed least value

clip_upper()--------------------------------Used to clip values above a passed maximum value

astype()-----------------------------------Method is used to change data type of a series

tolist()----------------------------------Method is used to convert a series to list

get()--------------------------------------Method is called on a Series to extract values from a Series. This is alternative syntax to the traditional bracket syntax

unique()-----------------------------------Pandas unique() is used to see the unique values in a particular column

nunique()---------------------------------Pandas nunique() is used to get a count of unique values

value_counts()-----------------------------Method to count the number of the times each unique value occurs in a Series

factorize()------------------------------Method helps to get the numeric representation of an array by identifying distinct values

map()------------------------------------Method to tie together the values from one object to another

between()---------------------------------Pandas between() method is used on series to check which values lie between first and second argument

apply()-----------------------------------Method is called and feeded a Python function as an argument to use the function on every Series value
. This method is helpful for executing custom operations that are not included in pandas or numpy
