<a href="https://colab.research.google.com/github/anujsaxena/AIML/blob/main/AIML_LAB_3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas**

Pandas is an open-source Python library that uses strong data structures to provide a high-performance data manipulation and analysis tool. Pandas gets its name from the term Panel Data, which is an Econometrics term for multidimensional data.

When developer Wes McKinney needed a high-performance, versatile tool for data analysis, he started creating pandas in 2008.
Python was formerly mostly used for data munging (the act of converting data to a different format) and preparation before Pandas. It made just a minor contribution to data analysis. This issue was fixed by pandas. Regardless of the data's origin, we may perform five common processes in data processing and analysis with Pandas:

1. load
2. prepare
3. manipulate
4. model
5. analysis

# **Key Features of Pandas**
1. Fast and efficient DataFrame object with default and customized indexing.
2. Tools for loading data into in-memory data objects from different file formats.
3. Data alignment and integrated handling of missing data.
4. Reshaping and pivoting of data sets.
5. Label-based slicing, indexing and subsetting of large data sets.
6. Columns from a data structure can be deleted or inserted.
7. Group by data for aggregation and transformations.
8. High performance merging and joining of data.
9. Time Series functionality.

# **Series**

A pandas Series can be created using the following constructor −

**pandas.Series(data, index, dtype, copy)**

Description of parameters

**data**

data takes various forms like ndarray, list, constants

**Index**

Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.

**dtype**

dtype is for data type. If None, data type will be inferred

**copy**

Copy data. Default False

An empty series can be created or it can take following input:
1. Array
2. Dictionary
3. Scalar Values or constants


In [86]:
import numpy as np
import pandas as pd
data = np.array(['a','e','i','o','u'])
print(data)
print(type(data))
s= pd.Series(data)
print(s)
print(type(s))

['a' 'e' 'i' 'o' 'u']
<class 'numpy.ndarray'>
0    a
1    e
2    i
3    o
4    u
dtype: object
<class 'pandas.core.series.Series'>


In [87]:
s= pd.Series(data, index=[100,101,102, 103, 104])
print(s)
print(type(s))

100    a
101    e
102    i
103    o
104    u
dtype: object
<class 'pandas.core.series.Series'>


In [88]:
a = np.arange(100,105)
s= pd.Series(data, a)
print(s)
print(type(s))

100    a
101    e
102    i
103    o
104    u
dtype: object
<class 'pandas.core.series.Series'>


# **From Dictionary**

In [4]:
data = {'a':0,'b':4,'c':8}
print(data)
print(type(data))

{'a': 0, 'b': 4, 'c': 8}
<class 'dict'>


In [6]:
s = pd.Series(data)
print(s)

a    0
b    4
c    8
dtype: int64


In [7]:
s = pd.Series(data,index=['b','c','a'])
print(s)

b    4
c    8
a    0
dtype: int64


In [8]:
s = pd.Series(data,index=['b','c','d','a'])
print(s)

b    4.0
c    8.0
d    NaN
a    0.0
dtype: float64


# **Scalar**

In [9]:
s = pd.Series(5, index=[1,2,3,4,5])
print(s)

1    5
2    5
3    5
4    5
5    5
dtype: int64


In [10]:
s = pd.Series('k', index=[1,2,3,4,5])
print(s)

1    k
2    k
3    k
4    k
5    k
dtype: object


# **Access Series with position**

In [11]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s)
#retrieve the first element
print(s[0])


a    1
b    2
c    3
d    4
e    5
dtype: int64
1


In [12]:
print(s[3])

4


In [14]:
#retrieve the elements
print('s[0:]',s[0:])
print('s[2:]',s[2:])
print('s[:1]',s[:1])
print('s[:3]',s[:3])
print('s[-3:]',s[-3:])
print('s[:-3]',s[:-3])


s[0:] a    1
b    2
c    3
d    4
e    5
dtype: int64
s[2:] c    3
d    4
e    5
dtype: int64
s[:1] a    1
dtype: int64
s[:3] a    1
b    2
c    3
dtype: int64
s[-3:] c    3
d    4
e    5
dtype: int64
s[:-3] a    1
b    2
dtype: int64


# **Retrieve index (labels)**

In [15]:
print(s)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [16]:
print(s['b'])

2


In [17]:
s = pd.Series(np.random.randn(4))
print(s)
print ("Is the Object empty?")
print(s.empty)

0   -1.336760
1   -0.043544
2    1.166165
3    0.648623
dtype: float64
Is the Object empty?
False


In [18]:
#get the dimension
s = pd.Series(np.random.randn(10))
print(s)
print ("The dimensions of the object:")
print(s.ndim)


0   -0.883393
1   -0.825048
2   -1.060432
3   -0.228159
4   -1.516877
5   -1.027429
6   -0.505182
7   -0.735062
8   -0.483171
9    0.832217
dtype: float64
The dimensions of the object:
1


In [19]:
#size
print(s.size)

10


In [20]:
print ("The actual data series is:")
l = s.values
print(l)
print(type(l))

The actual data series is:
[-0.88339305 -0.82504764 -1.06043202 -0.22815926 -1.51687667 -1.02742859
 -0.50518227 -0.73506218 -0.48317067  0.83221693]
<class 'numpy.ndarray'>


In [21]:
print(s)

0   -0.883393
1   -0.825048
2   -1.060432
3   -0.228159
4   -1.516877
5   -1.027429
6   -0.505182
7   -0.735062
8   -0.483171
9    0.832217
dtype: float64


# **Head and Tail**

In [22]:
print ("The first two rows of the data series:")
print(s.head(2))

The first two rows of the data series:
0   -0.883393
1   -0.825048
dtype: float64


In [23]:
print(s.head())

0   -0.883393
1   -0.825048
2   -1.060432
3   -0.228159
4   -1.516877
dtype: float64


In [24]:
print(s.tail())

5   -1.027429
6   -0.505182
7   -0.735062
8   -0.483171
9    0.832217
dtype: float64


In [25]:
print(s.tail(2))

8   -0.483171
9    0.832217
dtype: float64


# **Data Frame**

In [28]:
data =[['Sameer',45],['Akshita','32'],['Dhruv', 22]]
print(data)
print(type(data))

[['Sameer', 45], ['Akshita', '32'], ['Dhruv', 22]]
<class 'list'>


In [29]:
df = pd.DataFrame(data, columns=['Name','Age'])
print(df)

      Name Age
0   Sameer  45
1  Akshita  32
2    Dhruv  22


# **Create a DataFrame from Dict of ndarrays / Lists**

In [30]:
data ={'Name':['Sam','Ayushi','Varun','Vidushi'], 'Age':[22,25,21,29]}
print(data)
print(type(data))

{'Name': ['Sam', 'Ayushi', 'Varun', 'Vidushi'], 'Age': [22, 25, 21, 29]}
<class 'dict'>


In [32]:
df = pd.DataFrame(data)
print(df)
print(type(df))

      Name  Age
0      Sam   22
1   Ayushi   25
2    Varun   21
3  Vidushi   29
<class 'pandas.core.frame.DataFrame'>


In [34]:
data ={'Name':['Sam','Ayushi','Varun','Vidushi'], 'Percentage':[89,98,87,99]}
df = pd.DataFrame(data, index=['rank3', 'rank2','rank4','rank1'])
print(df)

          Name  Percentage
rank3      Sam          89
rank2   Ayushi          98
rank4    Varun          87
rank1  Vidushi          99


In [35]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
print(data)
print(type(data))
df = pd.DataFrame(data)
print(df)

[{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
<class 'list'>
   a   b     c
0  1   2   NaN
1  5  10  20.0


In [36]:
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [37]:
df0 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b','c'])
print(df0)
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)

        a   b     c
first   1   2   NaN
second  5  10  20.0
        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


# **Create a DataFrame from Dict of Series**

In [40]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
print(d)
df = pd.DataFrame(d)
print('-----------------')
print(df)


{'one': a    1
b    2
c    3
dtype: int64, 'two': a    1
b    2
c    3
d    4
dtype: int64}
-----------------
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


In [41]:
print(df ['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


# **Column Addition**

In [42]:
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


# **Adding a new column to an existing DataFrame object with column label by passing new series**

In [43]:
df["three"] = pd.Series([10,15,20,25], index=['a','b','c','d'])
print(df)

   one  two  three
a  1.0    1     10
b  2.0    2     15
c  3.0    3     20
d  NaN    4     25


In [44]:
df["four"] = pd.Series([10,15,20,25,30], index=['a','b','c','d','e'])
print(df)

   one  two  three  four
a  1.0    1     10    10
b  2.0    2     15    15
c  3.0    3     20    20
d  NaN    4     25    25


In [45]:
df["five"] =df["one"]+df["two"] + df["three"] + df["four"]
print(df)

   one  two  three  four  five
a  1.0    1     10    10  22.0
b  2.0    2     15    15  34.0
c  3.0    3     20    20  46.0
d  NaN    4     25    25   NaN


In [46]:
df["six"] = df['two']+df['three']+df['four']
print(df)

   one  two  three  four  five  six
a  1.0    1     10    10  22.0   21
b  2.0    2     15    15  34.0   32
c  3.0    3     20    20  46.0   43
d  NaN    4     25    25   NaN   54


# **Column Deletion**

In [47]:
print("Original dataframe")
print(df)
print ("Deleting the first column using del function:")
del df['one']
print("After deletion")
print(df)

Original dataframe
   one  two  three  four  five  six
a  1.0    1     10    10  22.0   21
b  2.0    2     15    15  34.0   32
c  3.0    3     20    20  46.0   43
d  NaN    4     25    25   NaN   54
Deleting the first column using del function:
After deletion
   two  three  four  five  six
a    1     10    10  22.0   21
b    2     15    15  34.0   32
c    3     20    20  46.0   43
d    4     25    25   NaN   54


In [48]:
df.pop('five')
print(df)

   two  three  four  six
a    1     10    10   21
b    2     15    15   32
c    3     20    20   43
d    4     25    25   54


# **Selecting a row / Accessing Dataframe**
Using Label


In [49]:
d = df.loc['a']
print(d)

two       1
three    10
four     10
six      21
Name: a, dtype: int64


In [50]:
d = df.loc['c']
print(d)

two       3
three    20
four     20
six      43
Name: c, dtype: int64


# **Using Integer Location**

In [52]:
print(df)
print('----------------')
d = df.iloc[2]   #loction of the row
print(d)

   two  three  four  six
a    1     10    10   21
b    2     15    15   32
c    3     20    20   43
d    4     25    25   54
----------------
two       3
three    20
four     20
six      43
Name: c, dtype: int64


# **Slicing**



In [53]:
d = df[2:4]
print(d)

   two  three  four  six
c    3     20    20   43
d    4     25    25   54


In [54]:
d = df[1:]
print(d)

   two  three  four  six
b    2     15    15   32
c    3     20    20   43
d    4     25    25   54


# **Row Addition**

In [55]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
print(df)
print(df2)

   a  b
0  1  2
1  3  4
   a  b
0  5  6
1  7  8


In [56]:
d = df.append(df2)
print(d)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


# **Delete Rows**

In [57]:
d = d.drop(0)
print(d)

   a  b
1  3  4
1  7  8


# **Functonality**

Transpose

In [58]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df)
t = df.T
print(t)

    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Smith  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8


# **axes**

In [60]:
print(df)

    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80


In [61]:
print(df.axes)

[RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]


# **dtype**

In [62]:
print(df.dtypes)

Name       object
Age         int64
Rating    float64
dtype: object


# **empty**

In [63]:
print(df)

    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80


In [64]:
print(df.empty)

False


# **ndim**

In [65]:
print(df.ndim)

2


# **Shape**

In [66]:
print(df.shape)

(7, 3)


# **size**

In [67]:
print(df.size)

21


# **Values**

In [68]:
print(df.values)

[['Tom' 25 4.23]
 ['James' 26 3.24]
 ['Ricky' 25 3.98]
 ['Vin' 23 2.56]
 ['Steve' 30 3.2]
 ['Smith' 29 4.6]
 ['Jack' 23 3.8]]


Statisiical functions 

In [69]:
print(df.std())

Age       2.734262
Rating    0.698628
dtype: float64


  """Entry point for launching an IPython kernel.


# **Functions & Description**
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

1.	count()	:Number of non-null observations
2.	sum()	:Sum of values
3.	mean()	:Mean of Values
4.	median()	:Median of Values
5.	mode()	:Mode of values
6.	std()	:Standard Deviation of the Values
7.	min()	:Minimum Value
8.	max()	:Maximum Value
9.	abs()	:Absolute Value
10.	prod()	:Product of Values
11.	cumsum()	:Cumulative Sum
12.	cumprod()	:Cumulative Product

Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

•	Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though npractice, character aggregations are never used generally, these functions do not throw any exception.

•	Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

In [70]:
#max()
import pandas as pd
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
                   "B":[5, 2, 54, 3, 2], 
                   "C":[20, 16, 7, 3, 8],  
                   "D":[14, 3, 17, 2, 6]}) 
print(df)
print(df.max())

    A   B   C   D
0  12   5  20  14
1   4   2  16   3
2   5  54   7  17
3  44   3   3   2
4   1   2   8   6
A    44
B    54
C    20
D    17
dtype: int64


In [71]:
print(df.max(axis=0)) #gets max of each column

A    44
B    54
C    20
D    17
dtype: int64


In [72]:
print(df.max(axis=1)) #gets max of each row

0    20
1    16
2    54
3    44
4     8
dtype: int64


In [73]:
df = pd.DataFrame({"A":[12, 4, 5, None, 1],  
                   "B":[7, 2, 54, 3, None], 
                   "C":[20, 16, 11, 3, 8], 
                   "D":[14, 3, None, 2, 6]}) 

df.max()

A    12.0
B    54.0
C    20.0
D    14.0
dtype: float64

In [74]:
print(df.max(axis=1)) #gets max of each row

0    20.0
1    16.0
2    54.0
3     3.0
4     8.0
dtype: float64


In [75]:
df = pd.DataFrame({"A":[12, 4, 5, None, 1],  
                   "B":[7, 2, 54, None, 3], 
                   "C":[20, 16, 11, None, 8], 
                   "D":[14, 3, 2, None, 6]}) 

df.max()

A    12.0
B    54.0
C    20.0
D    14.0
dtype: float64

In [76]:
print(df.max(axis=1)) #gets max of each row

0    20.0
1    16.0
2    54.0
3     NaN
4     8.0
dtype: float64


In [77]:
# skip the NaN values while finding the maximum 
df.max(axis = 0, skipna = True)

A    12.0
B    54.0
C    20.0
D    14.0
dtype: float64

In [78]:
df.max(axis = 1, skipna = True)

0    20.0
1    16.0
2    54.0
3     NaN
4     8.0
dtype: float64

In [79]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack','Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
print(df)
#Create a DataFrame
df = pd.DataFrame(d)
print(df.std())

      A     B     C     D
0  12.0   7.0  20.0  14.0
1   4.0   2.0  16.0   3.0
2   5.0  54.0  11.0   2.0
3   NaN   NaN   NaN   NaN
4   1.0   3.0   8.0   6.0
Age       9.232682
Rating    0.661628
dtype: float64


  if __name__ == '__main__':


# **Summaraize**



In [80]:
print(df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


In [83]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tina','Ria','Amit','Sumit','Manish','Mayank','Vishal','Vardan','Rahul,','Vikesh','Priyank','Bhavesh']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df)

       Name  Age  Rating
0      Tina   25    4.23
1       Ria   26    3.24
2      Amit   25    3.98
3     Sumit   23    2.56
4    Manish   30    3.20
5    Mayank   29    4.60
6    Vishal   23    3.80
7    Vardan   34    3.78
8    Rahul,   40    2.98
9    Vikesh   30    4.80
10  Priyank   51    4.10
11  Bhavesh   46    3.65


In [82]:
print(df.describe(include=['object']))

        Name
count     12
unique    12
top     Tina
freq       1


In [85]:
print(df.describe(include='all'))

        Name        Age     Rating
count     12  12.000000  12.000000
unique    12        NaN        NaN
top     Tina        NaN        NaN
freq       1        NaN        NaN
mean     NaN  31.833333   3.743333
std      NaN   9.232682   0.661628
min      NaN  23.000000   2.560000
25%      NaN  25.000000   3.230000
50%      NaN  29.500000   3.790000
75%      NaN  35.500000   4.132500
max      NaN  51.000000   4.800000
