<h1><center><font size=6>Python for Data Science: Pandas</center></font></h1>

In [147]:
# importing the libraries
import numpy as np
import pandas as pd

## 1. Numpy

**NumPy Array**
* An array is a data structure that stores values of same data type.
* While python lists can contain values corresponding to different data types, arrays in python can only contain values corresponding to the same data type.
* However python lists fail to deliver the performance required while computing large sets of numerical data. To address this issue we use NumPy arrays.
* We can create NumPy arrays by converting a list to an array.


In [148]:
# defining a list of different price or string elements
prices = [100, 350, 600, 200, 900]

sample_products = np.array(["Apple", "Headphones", "Banana", "Shirt", "Book"])
sample_products


array(['Apple', 'Headphones', 'Banana', 'Shirt', 'Book'], dtype='<U10')

In [149]:
type(sample_products)

numpy.ndarray

**NumPy Matrix**

* A matrix is a two-dimensional data structure where elements are arranged into rows and columns.
* A matrix can be created by using list of lists

In [150]:
# let's say we have information of different sizes of shirts in a store and we want to display them in a matrix format
matrix = np.array([[30,31,32],[33,34,35]])
print(matrix)


[[30 31 32]
 [33 34 35]]


### NumPy Functions

**Using np.arange() function**
* The np.arange() function returns an array with evenly spaced elements as per the interval. The interval mentioned is half-opened i.e. start is included but stop is excluded.
* It has the following paramaters:
  * start : start of interval range. By default start = 0
  * stop  : end of interval range
  * step  : step size of interval. By default step size = 1

In [151]:
arr2  = np.arange(start = 0, stop = 10) # 10 will be excluded from the output
print(arr2)

# or

arr2  = np.arange(0,10)
print(arr2)

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]


In [152]:
# adding a step size of 5 to create an array
arr3  = np.arange(start = 0, stop = 20, step = 5)
arr3

array([ 0,  5, 10, 15])

**Using np.linspace() function**
* The np.linspace() function returns numbers which are evenly distributed with respect to interval. Here the start and stop both are included.            
*It has the following parameters:              
 * start: start of interval range. By default start = 0
 * stop: end of interval range
 * num : No. of samples to generate. By default num = 50

In [153]:
matrix2 = np.linspace(0,5) # by default 50 evenly spaced values will be generated between 0 and 5
matrix2

array([0.        , 0.10204082, 0.20408163, 0.30612245, 0.40816327,
       0.51020408, 0.6122449 , 0.71428571, 0.81632653, 0.91836735,
       1.02040816, 1.12244898, 1.2244898 , 1.32653061, 1.42857143,
       1.53061224, 1.63265306, 1.73469388, 1.83673469, 1.93877551,
       2.04081633, 2.14285714, 2.24489796, 2.34693878, 2.44897959,
       2.55102041, 2.65306122, 2.75510204, 2.85714286, 2.95918367,
       3.06122449, 3.16326531, 3.26530612, 3.36734694, 3.46938776,
       3.57142857, 3.67346939, 3.7755102 , 3.87755102, 3.97959184,
       4.08163265, 4.18367347, 4.28571429, 4.3877551 , 4.48979592,
       4.59183673, 4.69387755, 4.79591837, 4.89795918, 5.        ])

In [154]:
# generating 10 evenly spaced values between 10 and 20
matrix3 = np.linspace(10,20,10)
matrix3

array([10.        , 11.11111111, 12.22222222, 13.33333333, 14.44444444,
       15.55555556, 16.66666667, 17.77777778, 18.88888889, 20.        ])

**How are these values getting generated?**

The step size or the difference between each element will be decided by the following formula:

**(stop - start) / (total elements - 1)**

So, in this case:
(5 - 0) / 49 = 0.10204082

The first value will be 0.10204082, the second value will be 0.10204082 + 0.10204082 = 0.20408163, the third value will be 0.10204082 + 0.10204082 +0.10204082 = 0.30612245, and so on.

In [155]:
# generating 50 evenly spaced values between 0 and 5
matrix3 = np.linspace(0,5,50)
matrix3

array([0.        , 0.10204082, 0.20408163, 0.30612245, 0.40816327,
       0.51020408, 0.6122449 , 0.71428571, 0.81632653, 0.91836735,
       1.02040816, 1.12244898, 1.2244898 , 1.32653061, 1.42857143,
       1.53061224, 1.63265306, 1.73469388, 1.83673469, 1.93877551,
       2.04081633, 2.14285714, 2.24489796, 2.34693878, 2.44897959,
       2.55102041, 2.65306122, 2.75510204, 2.85714286, 2.95918367,
       3.06122449, 3.16326531, 3.26530612, 3.36734694, 3.46938776,
       3.57142857, 3.67346939, 3.7755102 , 3.87755102, 3.97959184,
       4.08163265, 4.18367347, 4.28571429, 4.3877551 , 4.48979592,
       4.59183673, 4.69387755, 4.79591837, 4.89795918, 5.        ])

**Using np.zeros()**

* The np.zeros() is a function for creating a matrix and performing matrix operations in NumPy.
* It returns a matrix filled with zeros of the given shape.
* It has the following parameters:    
  * shape : Number of rows and columns in the output matrix.
  * dtype: data type of the elements in the matrix, by default the value is set to `float`.



**Using np.ones()**

* The np.ones() is another function for creating a matrix and performing matrix operations in NumPy.
* It returns a matrix of given shape and type, filled with ones.
* It has the following parameters:  
  * shape : Number of rows and columns in the output matrix.
  * dtype: data type of the elements in the matrix, by default the value is set to `float`.



**Using np.eye()**
* The np.eye() is a function for creating a matrix and performing matrix operations in NumPy.
* It returns a matrix with ones on the diagonal and zeros elsewhere.
* It has the following parameters:
  * n: Number of rows and columns in the output matrix
  * dtype: data type of the elements in the matrix, by default the value is set to `float`.

In [156]:
# np.zeros
matrix4 = np.zeros([3,5])
print(matrix4,'\n')

#np.ones
matrix5 = np.ones([3,5])
print(matrix5,'\n')

#np.eye
matrix6 = np.eye(5)
print(matrix6)

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]] 

[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]] 

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


**np.reshape**

* The shape of an array basically tells the number of elements and dimensions of the array. Reshaping a Numpy array simply means changing the shape of the given array.
* By reshaping an array we can add or remove dimensions or change number of elements in each dimension.
* In order to reshape a NumPy array, we use the reshape method with the given array.
* **Syntax:** array.reshape(shape)
  * shape: a tuple given as input, the values in tuple will be the new shape of the array.

In [157]:
# defining an array with values 0 to 9
arr4 = np.arange(0,10)
print('original array :',arr4,'\n')

# reshaping the array arr4 to a 2 x 5 matrix
arr4_reshaped = arr4.reshape((2,5))
print('Reshaped Array to 2x5 :\n',arr4_reshaped,'\n')

try:

  # reshaping the array arr4 to a 2 x 6 matrix. Using a try and except block to handle errors.
  arr4_reshaped2 = arr4.reshape((2,6))
  print(arr4_reshaped2)

except Exception as e:
  print('Error Occurred :',e)


original array : [0 1 2 3 4 5 6 7 8 9] 

Reshaped Array to 2x5 :
 [[0 1 2 3 4]
 [5 6 7 8 9]] 

Error Occurred : cannot reshape array of size 10 into shape (2,6)


* This did not work because we have 10 elements in the array that we are trying to fit in a 2 X 6 shape, which would require 12 elements.

**NumPy can also perform a large number of different mathematical operations and it provides different functions to do so.**

NumPy provides:
1. Trigonometric functions
2. Exponents and Logarithmic functions
3. Functions for arithmetic operations between arrays and matrices

**Trigonometric functions**

In [158]:
print('Sine Function:',np.sin(4))
print('Cosine Function:',np.cos(4))
print('Tan Function',np.tan(4))

Sine Function: -0.7568024953079282
Cosine Function: -0.6536436208636119
Tan Function 1.1578212823495775


**Exponents and Logarithmic functions**

* Exponents

In [159]:
np.exp(2)

7.38905609893065

In [160]:
arr5 = np.array([2,4,6])
np.exp(arr5)

array([  7.3890561 ,  54.59815003, 403.42879349])

* Logarithms

In [161]:
# by default NumPy takes the base of log as e
np.log(2)

0.6931471805599453

In [162]:
np.log(arr5)

array([0.69314718, 1.38629436, 1.79175947])

In [163]:
## log with base 10
np.log10(8)

0.9030899869919435

**Arithmetic Operations on arrays**

In [164]:
# arithmetic on lists

arr5 = [1,2,3]
arr6 = [4,5,6]
print(arr5+arr6)
# this does not behave as you would expect!

[1, 2, 3, 4, 5, 6]


In [165]:
# we can +-*/ arrays together

# defining two arrays
arr7 = np.arange(1,6)
print('arr7:', arr7)

arr8 = np.arange(3,8)
print('arr8:', arr8)

arr7: [1 2 3 4 5]
arr8: [3 4 5 6 7]


In [166]:
print('Addition: ',arr7+arr8)
print('Subtraction: ',arr8-arr7)
print('Multiplication:' , arr7*arr8)
print('Division:', arr7/arr8)
print('Inverse:', 1/arr7)
print('Powers:', arr7**arr8) # in python, powers are achieved using **, NOT ^!!! ^ does something completely different!

Addition:  [ 4  6  8 10 12]
Subtraction:  [2 2 2 2 2]
Multiplication: [ 3  8 15 24 35]
Division: [0.33333333 0.5        0.6        0.66666667 0.71428571]
Inverse: [1.         0.5        0.33333333 0.25       0.2       ]
Powers: [    1    16   243  4096 78125]


**Operations on Matrices**

In [167]:
matrix7 = np.arange(1,10).reshape(3,3)
print(matrix7,'\n')

matrix8 = np.eye(3)
print(matrix8)

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [168]:
print('Addition: \n', matrix7+matrix8)
print('Subtraction: \n ', matrix7-matrix8)
print('Multiplication: \n', matrix7*matrix8)
print('Division: \n', matrix7/matrix8)

Addition: 
 [[ 2.  2.  3.]
 [ 4.  6.  6.]
 [ 7.  8. 10.]]
Subtraction: 
  [[0. 2. 3.]
 [4. 4. 6.]
 [7. 8. 8.]]
Multiplication: 
 [[1. 0. 0.]
 [0. 5. 0.]
 [0. 0. 9.]]
Division: 
 [[ 1. inf inf]
 [inf  5. inf]
 [inf inf  9.]]


  print('Division: \n', matrix7/matrix8)


* RuntimeWarning: Errors which occur during program execution (run-time) after successful compilation are called run-time errors.
* One of the most common run-time error is division by zero also known as Division error.
* Due to division by zero error, we are getting inf (infinity) values because 1/0 is not a defined operation.

**Linear algebra matrix multiplication**

In [169]:
matrix9 = np.arange(1,10).reshape(3,3)
print('First Matrix: \n',matrix9)

matrix10 = np.arange(11,20).reshape(3,3)
print('Second Matrix: \n',matrix10)
print('')
# taking linear algebra matrix multiplication (some may have heard this called the dot product)
print('Multiplication: \n', matrix9 @ matrix10)

First Matrix: 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Second Matrix: 
 [[11 12 13]
 [14 15 16]
 [17 18 19]]

Multiplication: 
 [[ 90  96 102]
 [216 231 246]
 [342 366 390]]


**Transpose of a matrix**

In [170]:
print(matrix9)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [171]:
# taking transpose of matrix
np.transpose(matrix9)

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [172]:
# another way of taking a transpose
matrix9.T

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

**Function to find minimum and maximum values**

In [173]:
print(matrix9)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [174]:
print('Minimum value: ',np.min(matrix9))

Minimum value:  1


In [175]:
print('Maximum value: ',np.max(matrix9))

Maximum value:  9


**Function to generate random samples**

**Using np.random.rand function**

* The np.random.rand returns a random NumPy array whose element(s) are drawn randomly from the uniform distribution over [0,1). (including 0 but excluding 1).
* **Syntax** - np.random.rand(d0,d1)
  * d0,d1 – It represents the dimension of the required array given as int, where d1 is optional.

In [176]:
# Generating random values in an array
rand_mat = np.random.rand(5)
print(rand_mat)

[0.68315441 0.52889406 0.63983313 0.70869553 0.44343053]


In [177]:
# * Generating random values in a matrix
rand_mat = np.random.rand(5,5) # uniform random variable
print(rand_mat)

[[0.61008036 0.0102353  0.57690069 0.87381269 0.20362265]
 [0.00379387 0.62440205 0.27833235 0.90750125 0.97499556]
 [0.47884654 0.2027557  0.42912074 0.47071001 0.50251947]
 [0.50433684 0.39451315 0.68055893 0.35260319 0.31994877]
 [0.22213589 0.82959287 0.79630882 0.41557711 0.06313038]]


**Using np.random.randn function**

* The np.random.randn returns a random numpy array whose sample(s) are drawn randomly from the standard normal distribution (Mean as 0 and standard deviation as 1)

* **Syntax** - np.random.randn(d0,d1)
  * d0,d1 – It represents the dimension of the output, where d1 is optional.

In [178]:
# Generating random values in an array
rand_mat2 = np.random.randn(5)
print(rand_mat2)

[-1.44057745 -0.37254117 -0.67906123 -0.22172765 -0.06725497]


In [179]:
# Generating random values in a matrix
rand_mat2 = np.random.randn(5,5)
print(rand_mat2)

[[ 0.98558561 -0.46180973  0.89890246 -0.49150868 -0.36663698]
 [-0.78528598  0.12038687 -1.45295166 -1.31153858 -1.32324993]
 [-0.52754826  1.43394193 -1.57171895 -0.86124677 -0.57701783]
 [-0.28922087 -0.5638681  -1.33495761 -0.37793519  0.46876585]
 [ 0.98440679  1.36977918  1.11368844 -0.57482905 -0.30218044]]


In [180]:
# Let's check the mean and standard deviation of rand_mat2
print('Mean:',np.mean(rand_mat2))
print('Standard Deviation:',np.std(rand_mat2))

Mean: -0.2319219000885031
Standard Deviation: 0.89823189725636


*  We observe that the mean is very close to 0 and standard deviation is very close to 1.

**Using np.random.randint function**

* The np.random.randint returns a random numpy array whose element(s) are drawn randomly from low (inclusive) to the high (exclusive) range.

* **Syntax** - np.random.randint(low, high, size)

  * low – It represents the lowest inclusive bound of the distribution from where the sample can be drawn.
  * high – It represents the upper exclusive bound of the distribution from where the sample can be drawn.
  * size – It represents the shape of the output.

In [181]:
# Generating random values in an array
rand_mat3 = np.random.randint(1,5,10)
print(rand_mat3)

[2 3 4 4 3 2 1 3 2 2]


In [182]:
# Generating random values in a matrix
rand_mat3 = np.random.randint(1,10,[5,5])
print(rand_mat3)

[[4 1 7 8 3]
 [5 6 4 1 2]
 [2 3 3 9 8]
 [9 5 4 5 2]
 [9 2 6 5 2]]


### Accessing the entries of a NumPy Array

In [183]:
# let's generate an array with 10 random values
rand_arr = np.random.randn(10)
print(rand_arr)

[ 1.18143326  0.8091198  -2.04324545 -0.25914784  0.76678129 -0.78625439
 -0.37961151  0.28812754 -0.85373917  0.19290153]


* Accessing one element from an array

In [184]:
# accessing the 6 th entry of rand_arr
print(rand_arr[6])

-0.3796115145036556


* Accessing multiple elements from an array

In [185]:
# we can access multiple entries at once using
print(rand_arr[4:9])

[ 0.76678129 -0.78625439 -0.37961151  0.28812754 -0.85373917]


In [186]:
# we can also access multiple non-consecutive entries using np.arange
print('Index of values to access: ',np.arange(3,10,3))
print(rand_arr[np.arange(3,10,3)])

Index of values to access:  [3 6 9]
[-0.25914784 -0.37961151  0.19290153]


**Accessing arrays using logical operations**

In [187]:
print(rand_arr)

[ 1.18143326  0.8091198  -2.04324545 -0.25914784  0.76678129 -0.78625439
 -0.37961151  0.28812754 -0.85373917  0.19290153]


In [188]:
rand_arr>0

array([ True,  True, False, False,  True, False, False,  True, False,
        True])

In [189]:
# accessing all the values of rand_arr which are greater than 0
print('Values greater than 0: ',rand_arr[rand_arr>0])

# accessing all the values of rand_arr which are less than 0
print('Values less than 0: ',rand_arr[rand_arr<0])

Values greater than 0:  [1.18143326 0.8091198  0.76678129 0.28812754 0.19290153]
Values less than 0:  [-2.04324545 -0.25914784 -0.78625439 -0.37961151 -0.85373917]


**Accessing the entries of a Matrix**

In [190]:
# let's generate an array with 10 random values
rand_mat = np.random.randn(5,5)
print(rand_mat)

[[-0.87896698 -0.48813065  0.4382081  -0.43627971 -0.92807524]
 [ 1.38115664  0.04015422 -0.39104822 -0.78584337  0.61534959]
 [-2.19621309 -0.95273923  0.80966635 -2.84564654 -0.85839525]
 [-0.56089931 -0.40337039 -0.78963207  1.74278688 -0.67374498]
 [ 1.13716951 -0.72771408  0.92996247  0.34360341  1.0239192 ]]


In [191]:
# acessing the second row of the rand_mat
rand_mat[1]

array([ 1.38115664,  0.04015422, -0.39104822, -0.78584337,  0.61534959])

In [192]:
# acessing third element of the second row
print(rand_mat[1][2])

#or

print(rand_mat[1,2])

-0.39104822450674503
-0.39104822450674503


In [193]:
# accessing first two rows with second and third column
print(rand_mat[0:2,1:3])

[[-0.48813065  0.4382081 ]
 [ 0.04015422 -0.39104822]]


**Accessing matrices using logical operations**

In [194]:
print(rand_mat)

[[-0.87896698 -0.48813065  0.4382081  -0.43627971 -0.92807524]
 [ 1.38115664  0.04015422 -0.39104822 -0.78584337  0.61534959]
 [-2.19621309 -0.95273923  0.80966635 -2.84564654 -0.85839525]
 [-0.56089931 -0.40337039 -0.78963207  1.74278688 -0.67374498]
 [ 1.13716951 -0.72771408  0.92996247  0.34360341  1.0239192 ]]


In [195]:
# accessing all the values of rand_mat which are greater than 0
print('Values greater than 0: \n ',rand_mat[rand_mat>0])

# accessing all the values of rand_mat which are less than 0
print('Values less than 0: \n',rand_mat[rand_mat<0])

Values greater than 0: 
  [0.4382081  1.38115664 0.04015422 0.61534959 0.80966635 1.74278688
 1.13716951 0.92996247 0.34360341 1.0239192 ]
Values less than 0: 
 [-0.87896698 -0.48813065 -0.43627971 -0.92807524 -0.39104822 -0.78584337
 -2.19621309 -0.95273923 -2.84564654 -0.85839525 -0.56089931 -0.40337039
 -0.78963207 -0.67374498 -0.72771408]


**Modifying the entries of an Array**

In [196]:
print(rand_arr)


[ 1.18143326  0.8091198  -2.04324545 -0.25914784  0.76678129 -0.78625439
 -0.37961151  0.28812754 -0.85373917  0.19290153]


In [197]:
# let's change some values in an array!
# changing the values of index value 3 and index value 4 to 5
rand_arr[3:5] = 5
print(rand_arr)

[ 1.18143326  0.8091198  -2.04324545  5.          5.         -0.78625439
 -0.37961151  0.28812754 -0.85373917  0.19290153]


In [198]:
# changing the values of index value 0 and index value 1 to 2 and 3 respectively
rand_arr[0:2] = [2,3]
print(rand_arr)

[ 2.          3.         -2.04324545  5.          5.         -0.78625439
 -0.37961151  0.28812754 -0.85373917  0.19290153]


In [199]:
# modify entries using logical references
rand_arr[rand_arr>0] = 65
rand_arr

array([65.        , 65.        , -2.04324545, 65.        , 65.        ,
       -0.78625439, -0.37961151, 65.        , -0.85373917, 65.        ])

**Modifying the entries of a Matrix**

In [200]:
print(rand_mat3)

[[4 1 7 8 3]
 [5 6 4 1 2]
 [2 3 3 9 8]
 [9 5 4 5 2]
 [9 2 6 5 2]]


In [201]:
# changing the values of the 4th and 5th element of the second and third rows of the matrix to 0
print('Matrix before modification: \n',rand_mat3)
rand_mat3[1:3,3:5] = 0
print('Matrix after modification: \n',rand_mat3)

Matrix before modification: 
 [[4 1 7 8 3]
 [5 6 4 1 2]
 [2 3 3 9 8]
 [9 5 4 5 2]
 [9 2 6 5 2]]
Matrix after modification: 
 [[4 1 7 8 3]
 [5 6 4 0 0]
 [2 3 3 0 0]
 [9 5 4 5 2]
 [9 2 6 5 2]]


In [202]:
# extracting the first 2 rows and first 3 columns from the matrix
sub_mat = rand_mat[0:2,0:3]
print(sub_mat)

[[-0.87896698 -0.48813065  0.4382081 ]
 [ 1.38115664  0.04015422 -0.39104822]]


In [203]:
# changing all the values of the extracted matrix to 3
sub_mat[:] = 3
print(sub_mat)

[[3. 3. 3.]
 [3. 3. 3.]]


In [204]:
# what happened to rand_mat when we change sub_mat?
rand_mat

array([[ 3.        ,  3.        ,  3.        , -0.43627971, -0.92807524],
       [ 3.        ,  3.        ,  3.        , -0.78584337,  0.61534959],
       [-2.19621309, -0.95273923,  0.80966635, -2.84564654, -0.85839525],
       [-0.56089931, -0.40337039, -0.78963207,  1.74278688, -0.67374498],
       [ 1.13716951, -0.72771408,  0.92996247,  0.34360341,  1.0239192 ]])

In [205]:
# to prevent this behavior we need to use the .copy() method when we assign sub_mat
# this behavior is the source of MANY errors for early python users!!!

rand_mat = np.random.randn(5,5)
print(rand_mat)
sub_mat = rand_mat[0:2,0:3].copy()
sub_mat[:] = 3
print(sub_mat)
print(rand_mat)

[[ 0.79857218 -0.41907161 -0.40220847  2.42532756 -0.30890133]
 [-0.1595512  -0.28692254 -0.39485919  1.90142022  0.62263269]
 [ 0.7411203   1.04208282  0.46822234 -0.58934298  0.28257484]
 [-1.02031043 -2.24531001 -0.81272053 -1.56518607  0.56330285]
 [-0.49268167  0.89693903  1.23744575 -0.55899481 -0.97293666]]
[[3. 3. 3.]
 [3. 3. 3.]]
[[ 0.79857218 -0.41907161 -0.40220847  2.42532756 -0.30890133]
 [-0.1595512  -0.28692254 -0.39485919  1.90142022  0.62263269]
 [ 0.7411203   1.04208282  0.46822234 -0.58934298  0.28257484]
 [-1.02031043 -2.24531001 -0.81272053 -1.56518607  0.56330285]
 [-0.49268167  0.89693903  1.23744575 -0.55899481 -0.97293666]]


Lets convert the list Given a list of product prices, convert it into a numpy array.

In [206]:
prices = [100, 350, 600, 200, 900]
prices_array = np.array(prices)
print("Numpy Array of Prices:", prices_array)


Numpy Array of Prices: [100 350 600 200 900]


**Given a numpy array of product prices and another array of the corresponding quantities sold, calculate the total and average sales.**

In [207]:
prices = np.array([100, 350, 600, 200, 900])
quantities = np.array([30, 50, 20, 60, 110])

total_sales = np.sum(prices * quantities)
average_sales = np.mean(prices * quantities)
print("Total Sales:", total_sales)
print("Average Sales:", average_sales)


Total Sales: 143500
Average Sales: 28700.0


**Given a numpy array of product names, another array of prices, and a third array of quantities sold, identify the top 2 products generating the highest revenue.**

In [208]:
products = np.array(["Apple", "Headphones", "Banana", "Shirt", "Book"])
prices = np.array([100, 1350, 80, 200, 200])
quantities = np.array([100, 50, 200, 40, 70])

revenues = prices * quantities
top_indices = np.argsort(revenues)[-2:]  # Get indices of top 2 products
top_products = products[top_indices]
print("Best-Selling Products:", top_products)


Best-Selling Products: ['Banana' 'Headphones']


**Given a numpy array of product prices and another array of the same length representing the quantity sold for each product, calculate the weighted average price of the products. The weighted average should account for the quantity sold of each product.**

In [209]:
prices = np.array([100, 350, 600, 200, 900])
quantities_sold = np.array([5, 3, 2, 4, 1])

weighted_average_price = np.sum(prices * quantities_sold) / np.sum(quantities_sold)
print("Weighted Average Price:", weighted_average_price)



Weighted Average Price: 296.6666666666667


**Q15. Calculate Total Revenue for Each Product**

In [210]:
# Example data
product_names = np.array(["Apple", "Bread", "Milk", "Eggs"])
prices = np.array([100, 60, 50, 70])
quantities_sold = np.array([100, 80, 90, 200])

# Calculate total revenue for each product
total_revenues = prices * quantities_sold

# Display the total revenue for each product
for name, revenue in zip(product_names, total_revenues):
    print(f"Total Revenue for {name}: {revenue} Rs")

Total Revenue for Apple: 10000 Rs
Total Revenue for Bread: 4800 Rs
Total Revenue for Milk: 4500 Rs
Total Revenue for Eggs: 14000 Rs


**How can you increase the price of all products priced below 200 by 10% using NumPy?**

In [211]:
# Assuming prices is a NumPy array
prices = np.array([150.0, 200.0, 100.0, 350.0])  # Example prices

# Increase prices by 10% where the condition is met
prices[prices < 200.00] *= 1.10
print("Updated Prices:", prices)


Updated Prices: [165. 200. 110. 350.]


## Pandas -  Series and DataFrames

#### **Pandas Series**
* Pandas Series is a one-dimensional labeled array/list capable of holding data of any type (integer, string, float, python objects, etc.).
* The labels are collectively called index.
* Pandas Series can be thought as a single column of an excel spreadsheet and each entry in a series corresponds to an individual row in the spreadsheet.
* Pandas Series has the default index (0, 1, 2, 3, 4).

In [212]:
# creating a list of price of different medicines
med_price_list = [200,150,75,100,125]

# converting the med_price_list to an array
med_price_arr = np.array(med_price_list)

# converting the list and array into a Pandas Series object
series_list = pd.Series(med_price_list)
series_arr = pd.Series(med_price_arr)

# printing the converted series object
print(series_list)
print(series_arr)
print(type(series_list))
print(type(series_arr))

0    200
1    150
2     75
3    100
4    125
dtype: int64
0    200
1    150
2     75
3    100
4    125
dtype: int64
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


* We can see that the list and array have been converted to a Pandas Series object.
* We also see that the series has automatically got index labels. Let's see how these can be modified.

In [213]:
# changing the index of a series
med_price_list_labeled = pd.Series(med_price_list, index = ['Omeprazole','Azithromycin','Metformin','Ibuprofen','Cetirizine'])
print(med_price_list_labeled)


Omeprazole      200
Azithromycin    150
Metformin        75
Ibuprofen       100
Cetirizine      125
dtype: int64


**Q1. Create a pandas series of sales figures for a product over a week (in units)**  

In [214]:
# Sales figures for a product over a week (in units)
sales_data = [120, 135, 98, 105, 150, 160, 175]

# Creating a pandas Series
weekly_sales = pd.Series(sales_data, index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

print(weekly_sales)
print(type(weekly_sales))

Monday       120
Tuesday      135
Wednesday     98
Thursday     105
Friday       150
Saturday     160
Sunday       175
dtype: int64
<class 'pandas.core.series.Series'>


**Performing mathematical operations on Pandas Series**

* The price of each medicine was increased by 10 Rs. Let's add this to the existing price.

In [215]:
# adding 2.5 to existing prices
med_price_list_labeled_updated = med_price_list_labeled + 10
med_price_list_labeled_updated

Omeprazole      210
Azithromycin    160
Metformin        85
Ibuprofen       110
Cetirizine      135
dtype: int64

* A new price list was released by vendors for each medicine. Let's find the difference between new price and the old price

In [216]:
new_price_list = [230, 180, 100, 130, 155]
new_price_list_labeled = pd.Series(new_price_list, index = ['Omeprazole','Azithromycin','Metformin','Ibuprofen','Cetirizine'])
print(new_price_list_labeled)

Omeprazole      230
Azithromycin    180
Metformin       100
Ibuprofen       130
Cetirizine      155
dtype: int64


In [217]:
print('Difference between new price and old price - ')
print(new_price_list_labeled - med_price_list_labeled_updated)

Difference between new price and old price - 
Omeprazole      20
Azithromycin    20
Metformin       15
Ibuprofen       20
Cetirizine      20
dtype: int64


**Q2. The sales figures for sales increased with 15%. Add this to the existing sales figures**

In [218]:
# adding 2.5 to existing prices
new_weekly_sales = weekly_sales * 1.15
new_weekly_sales

Monday       138.00
Tuesday      155.25
Wednesday    112.70
Thursday     120.75
Friday       172.50
Saturday     184.00
Sunday       201.25
dtype: float64

#### **Pandas DataFrame**

Pandas DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns).

**Creating a Pandas DataFrame using a list**

In [219]:
mobiles = ['Apple', 'Samsung', 'Oppo', 'Vivo', 'Motorola']
df1 = pd.DataFrame(mobiles,columns=['Mobiles'])
df1

Unnamed: 0,Mobiles
0,Apple
1,Samsung
2,Oppo
3,Vivo
4,Motorola


**Creating a Pandas DataFrame using a dictionary**

In [220]:
# defining another list
Rating = ['A++','A+','C', 'B+', 'A-']

# creating the dataframe using a dictionary
df2 = pd.DataFrame({'Mobiles':mobiles,'Rating':Rating})
df2

Unnamed: 0,Mobiles,Rating
0,Apple,A++
1,Samsung,A+
2,Oppo,C
3,Vivo,B+
4,Motorola,A-


**Creating a Pandas DataFrame using Series**

The data for total mobile sales was collected from 2012 - 2018. Let's see how this data can be presented in form of data frame.

**Note** - The values are in million units.

In [221]:
year = pd.Series([2012,2013,2014,2015,2016,2017,2018])
sales = pd.Series([218,251,292,338,405,486,583])

df3 = pd.DataFrame({'Year':year,'Sales(Million units)':sales})
df3

Unnamed: 0,Year,Sales(Million units)
0,2012,218
1,2013,251
2,2014,292
3,2015,338
4,2016,405
5,2017,486
6,2018,583


**Creating a Pandas DataFrame using random values**

For encryption purposes a web browser company wants to generate random values which have mean equal to 0 and variance equal to 1. They want 5 randomly generated numbers in 2 different trials.

In [222]:
# we can create a new dataframe using random values
df4 = pd.DataFrame(np.random.randn(5,2),columns = ['Trial 1', 'Trial 2'])
df4

Unnamed: 0,Trial 1,Trial 2
0,1.217397,0.698037
1,0.368707,-0.486376
2,-1.871641,0.851957
3,0.406917,-0.339497
4,1.36421,-0.431518


**Q3. Let's say we have data for two products - "Product A" and "Product B" - and their sales for each day of the week. Create sales data for both products over a week**

In [223]:
# Sales data for two products over a week
data = [
    [120, 135, 98, 105, 150, 160, 175],  # Product A
    [80, 90, 110, 95, 100, 105, 120]     # Product B
]

# Create DataFrame
df_from_list = pd.DataFrame(data, index=['Product A', 'Product B'], columns=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

print(df_from_list)

           Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
Product A     120      135         98       105     150       160     175
Product B      80       90        110        95     100       105     120


**Q4. Create a sales dataframe using dictionary**





In [224]:
# Sales data for two products over a week
data_dict = {
    'Monday': [120, 80],
    'Tuesday': [135, 90],
    'Wednesday': [98, 110],
    'Thursday': [105, 95],
    'Friday': [150, 100],
    'Saturday': [160, 105],
    'Sunday': [175, 120]
}

# Create DataFrame
df_from_dict = pd.DataFrame(data_dict, index=['Product A', 'Product B'])

print(df_from_dict)

           Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
Product A     120      135         98       105     150       160     175
Product B      80       90        110        95     100       105     120


**Q5. Creating sales dataframe using series.**



In [225]:
# Creating Series for each product
series_a = pd.Series([120, 135, 98, 105, 150, 160, 175], index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
series_b = pd.Series([80, 90, 110, 95, 100, 105, 120], index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

# Create DataFrame
df_from_series = pd.DataFrame({'Product A': series_a, 'Product B': series_b})

print(df_from_series)


           Product A  Product B
Monday           120         80
Tuesday          135         90
Wednesday         98        110
Thursday         105         95
Friday           150        100
Saturday         160        105
Sunday           175        120


#### Pandas - Accessing and Modifying

**Accessing Series**

The revenue (in billion Rupees) of different telecommunication operators in India. was collected for the year of 2020. The following lists consist of the names of the telecommunication operators and their respective revenue (in billion dollars).

In [226]:
operators = ['Samsung', 'Vivo', 'Motorola', 'Apple']
revenue = [171.76, 128.29, 68.4, 4.04]

#creating a Series from lists
telecom = pd.Series(revenue, index=operators)
telecom

Samsung     171.76
Vivo        128.29
Motorola     68.40
Apple         4.04
dtype: float64

**Accessing Pandas Series using its index**

In [227]:
# accessing the first element of series
telecom[0]

171.76

In [228]:
#  accessing firt 3 elements of a series
telecom[:3]

Samsung     171.76
Vivo        128.29
Motorola     68.40
dtype: float64

In [229]:
# accessing the last two elements of a series
telecom[-2:]

Motorola    68.40
Apple        4.04
dtype: float64

In [230]:
# accessing multiple elements of a series
telecom[[0,2,3]]

Samsung     171.76
Motorola     68.40
Apple         4.04
dtype: float64

**Accessing Pandas Series using its labeled index**

In [231]:
# accessing the revenue of Samsung
telecom['Samsung']

171.76

In [232]:
#  accessing firt 3 revenues of operators in the series
telecom[:'Apple']

Samsung     171.76
Vivo        128.29
Motorola     68.40
Apple         4.04
dtype: float64

In [233]:
# accessing multiple values
telecom[['Samsung','Vivo','Motorola']]

Samsung     171.76
Vivo        128.29
Motorola     68.40
dtype: float64

**Accessing DataFrames**

The data of the customers visiting 24/7 Stores from different locations was collected. The data includes Customer ID, location of store, gender of the customer,  type of product purchased, quantity of products purchased, total bill amount. Let's create the dataset and see how to access different entries of it.

In [234]:
# creating the dataframe using dictionary
store_data = pd.DataFrame({'CustomerID': ['CustID00','CustID01','CustID02','CustID03','CustID04']
                           ,'location': ['Mumbai', 'Chennai', 'Bangalore', 'Kolkata', 'Pune']
                           ,'gender': ['M','M','F','M','F']
                           ,'type': ['Electronics','Food','Beverages','Medicine','Beauty']
                           ,'quantity':[1,3,4,2,1],'total_bill':[600,100,150,200,80]})
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600
1,CustID01,Chennai,M,Food,3,100
2,CustID02,Bangalore,F,Beverages,4,150
3,CustID03,Kolkata,M,Medicine,2,200
4,CustID04,Pune,F,Beauty,1,80


In [235]:
# accessing first row of the dataframe
store_data[:1]

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600


In [236]:
# accessing first column of the dataframe
store_data['location']

0       Mumbai
1      Chennai
2    Bangalore
3      Kolkata
4         Pune
Name: location, dtype: object

In [237]:
# accessing rows with the step size of 2
store_data[::2]

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600
2,CustID02,Bangalore,F,Beverages,4,150
4,CustID04,Pune,F,Beauty,1,80


In [238]:
# accessing the rows in reverse
store_data[::-2]

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
4,CustID04,Pune,F,Beauty,1,80
2,CustID02,Bangalore,F,Beverages,4,150
0,CustID00,Mumbai,M,Electronics,1,600


**Q6. Create a sales dataframe using dictionary. and answer below questions**
> Access first 4 rows.

> Print categories from data.

> Print all rows in reverse.

In [239]:
# Data for the DataFrame
data = {
    'Product': ['Shirt', 'Pants', 'Shoes', 'Hat', 'Socks'],
    'Category': ['Apparel', 'Apparel', 'Footwear', 'Accessories', 'Apparel'],
    'Price': [25.99, 40.50, 89.95, 15.99, 5.99],
    'Units Sold': [120, 85, 60, 150, 200],
    'Revenue': [25.99 * 120, 40.50 * 85, 89.95 * 60, 15.99 * 150, 5.99 * 200]}


# Creating the DataFrame
revenue_df = pd.DataFrame(data)

revenue_df

Unnamed: 0,Product,Category,Price,Units Sold,Revenue
0,Shirt,Apparel,25.99,120,3118.8
1,Pants,Apparel,40.5,85,3442.5
2,Shoes,Footwear,89.95,60,5397.0
3,Hat,Accessories,15.99,150,2398.5
4,Socks,Apparel,5.99,200,1198.0


In [240]:
# Access first 4 rows.
revenue_df[:4]

Unnamed: 0,Product,Category,Price,Units Sold,Revenue
0,Shirt,Apparel,25.99,120,3118.8
1,Pants,Apparel,40.5,85,3442.5
2,Shoes,Footwear,89.95,60,5397.0
3,Hat,Accessories,15.99,150,2398.5


In [241]:
#Print categories from data.
revenue_df['Category']

0        Apparel
1        Apparel
2       Footwear
3    Accessories
4        Apparel
Name: Category, dtype: object

In [242]:
#Print all rows in reverse.
revenue_df[::-1]

Unnamed: 0,Product,Category,Price,Units Sold,Revenue
4,Socks,Apparel,5.99,200,1198.0
3,Hat,Accessories,15.99,150,2398.5
2,Shoes,Footwear,89.95,60,5397.0
1,Pants,Apparel,40.5,85,3442.5
0,Shirt,Apparel,25.99,120,3118.8


#### **Using loc and iloc method**

**loc method**

* loc is a  method to access rows and columns on pandas objects. When using the loc method on a dataframe, we specify which rows and which columns we want by using the following format:

  * **dataframe.loc[row selection, column selection]**

* DataFrame.loc[] method is a method that takes **only index labels** and returns row or dataframe if the index label exists in the data frame.

In [243]:
# accessing first index value using loc method (indexing starts from 0 in python)
store_data.loc[1]

CustomerID    CustID01
location       Chennai
gender               M
type              Food
quantity             3
total_bill         100
Name: 1, dtype: object

**Accessing selected rows and columns using loc method**

In [244]:
# accessing 1st and 4th index values along with location and type columns
store_data.loc[[1,4],['location','type']]

Unnamed: 0,location,type
1,Chennai,Food
4,Pune,Beauty


**iloc method**

* The iloc indexer for Pandas Dataframe is used for **integer location-based** indexing/selection by position. When using the loc method on a dataframe, we specify which rows and which columns we want by using the following format:

  * **dataframe.iloc[row selection, column selection]**



In [245]:
# accessing selected rows and columns using iloc method
store_data.iloc[[1,4],[0,2]]

Unnamed: 0,CustomerID,gender
1,CustID01,M
4,CustID04,F


**Difference between loc and iloc indexing methods**

* loc is label-based, which means that you have to specify rows and columns based on their row and column labels.
* iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).


If we use labels instead of index values in .iloc it will throw an error.

In [246]:
# accessing selected rows and columns using iloc method
store_data.iloc[[1,4],[1, 3]]

Unnamed: 0,location,type
1,Chennai,Food
4,Pune,Beauty


* As expected, .iloc has given error on using 'labels'.

We can modify entries of a dataframe using loc or iloc too

In [247]:
print(store_data.loc[4,'type'])
store_data.loc[4,'type'] = 'Electronics'

Beauty


In [248]:
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600
1,CustID01,Chennai,M,Food,3,100
2,CustID02,Bangalore,F,Beverages,4,150
3,CustID03,Kolkata,M,Medicine,2,200
4,CustID04,Pune,F,Electronics,1,80


In [249]:
store_data.iloc[4,3] = 'Beauty'
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600
1,CustID01,Chennai,M,Food,3,100
2,CustID02,Bangalore,F,Beverages,4,150
3,CustID03,Kolkata,M,Medicine,2,200
4,CustID04,Pune,F,Beauty,1,80


**Q7. Based on the revenue_df (created in Q6) Retrieve the Price and Units Sold for the product 'Shoes' using the .loc[] method.**

In [250]:
shoes_data = revenue_df.loc[revenue_df['Product'] == 'Shoes', ['Price', 'Units Sold']]

shoes_data


Unnamed: 0,Price,Units Sold
2,89.95,60


**Q8. Using the .iloc[] method, what are the Category and Revenue for the third product in the DataFrame?**

In [251]:
third_product_data = revenue_df.iloc[2, [1, 4]]

third_product_data

Category    Footwear
Revenue       5397.0
Name: 2, dtype: object

**Q9.  Find out how many units of 'Apparel' category products were sold in total.**

In [252]:
total_apparel_units = revenue_df.loc[revenue_df['Category'] == 'Apparel', 'Units Sold'].sum()

total_apparel_units


405

**Q10. Use .iloc[] to extract the data for the second and fourth products in the DataFrame, including all columns.**

In [253]:
second_fourth_products = revenue_df.iloc[[1, 3], :]
second_fourth_products

Unnamed: 0,Product,Category,Price,Units Sold,Revenue
1,Pants,Apparel,40.5,85,3442.5
3,Hat,Accessories,15.99,150,2398.5


**Condition based indexing**

In [254]:
store_data['quantity']>1

0    False
1     True
2     True
3     True
4    False
Name: quantity, dtype: bool

* Wherever the condition of greater than 1 is satisfied in quantity column, 'True' is returned. Let's retrieve the original values wherever the condition is satisfied.

In [255]:
store_data.loc[store_data['quantity']>1]

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
1,CustID01,Chennai,M,Food,3,100
2,CustID02,Bangalore,F,Beverages,4,150
3,CustID03,Kolkata,M,Medicine,2,200


* Wherever the condition is satisfied we get the original values, and wherever the condition is not satisfied we do not get those records in the output.

**Column addition and removal from a Pandas DataFrame**

**Adding a new column in a DataFrame**

In [256]:
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill
0,CustID00,Mumbai,M,Electronics,1,600
1,CustID01,Chennai,M,Food,3,100
2,CustID02,Bangalore,F,Beverages,4,150
3,CustID03,Kolkata,M,Medicine,2,200
4,CustID04,Pune,F,Beauty,1,80


In [257]:
# adding a new column in data frame store_data which is a rating (out of 5) given by customer based on their shopping experience
store_data['rating'] = [2,5,3,4,4]
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill,rating
0,CustID00,Mumbai,M,Electronics,1,600,2
1,CustID01,Chennai,M,Food,3,100,5
2,CustID02,Bangalore,F,Beverages,4,150,3
3,CustID03,Kolkata,M,Medicine,2,200,4
4,CustID04,Pune,F,Beauty,1,80,4


**Removing a column from a DataFrame**

* The CustomerID column is a unique identifier of each customer. This unique identifier will not help 24/7 Stores in getting useful insights about their customers. So, they have decided to remove this column from the data frame.

In [258]:
store_data.drop('CustomerID',axis=1)

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Chennai,M,Food,3,100,5
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


* We sucessfully removed the 'CustomerID' from the dataframe, but this change is not permanent in the dataframe
* Let's have a look at the store_data again to confirm the same

In [259]:
store_data

Unnamed: 0,CustomerID,location,gender,type,quantity,total_bill,rating
0,CustID00,Mumbai,M,Electronics,1,600,2
1,CustID01,Chennai,M,Food,3,100,5
2,CustID02,Bangalore,F,Beverages,4,150,3
3,CustID03,Kolkata,M,Medicine,2,200,4
4,CustID04,Pune,F,Beauty,1,80,4


* We see that store_data still has column 'CustomerID' in it.
* To make permanent changes to a dataframe,  we will have to use the parameter `inplace` and set its value to `True`.

In [260]:
store_data.drop('CustomerID',axis=1,inplace=True)
store_data

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Chennai,M,Food,3,100,5
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


* Now the column has been permanently removed from the dataframe.


* We can also remove multiple columns simultaneously
* It is always a good idea to store the new/updated dataframes in new variables to avoid changes to the existing dataframe

In [261]:
# creating a copy of the existing data frame
new_store_data = store_data.copy()
store_data

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Chennai,M,Food,3,100,5
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


In [262]:
# dropping location and rating columns simultaneously
# the columns to be dropped are passed as a list to the drop() function
new_store_data.drop(['location','rating'],axis=1,inplace=True)
new_store_data

Unnamed: 0,gender,type,quantity,total_bill
0,M,Electronics,1,600
1,M,Food,3,100
2,F,Beverages,4,150
3,M,Medicine,2,200
4,F,Beauty,1,80


In [263]:
# lets check if store_data was impacted
store_data

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Chennai,M,Food,3,100,5
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


* There were no changes to data frame store_data.

* Deep copy stores copies of the object’s value.
* Shallow Copy stores the references of objects to the original memory address.

**Removing rows from a dataframe**

In [264]:
store_data.drop(1,axis=0)

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


In [265]:
store_data

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Chennai,M,Food,3,100,5
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


* Notice that we used **`axis=0`** to drop a row from a data frame, while we were using **`axis=1`** for dropping a column from the data frame.
* Also, to make permanent changes to the data frame we will have to use `inplace=True` parameter.
* We also see that the index are not correct now as first row has been removed. So, we will have to reset the index of the data frame. Let's see how this can be done.

In [266]:
# creating a new dataframe
store_data_new  = store_data.drop(1,axis=0)
store_data_new

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
2,Bangalore,F,Beverages,4,150,3
3,Kolkata,M,Medicine,2,200,4
4,Pune,F,Beauty,1,80,4


In [267]:
# resetting the index of data frame
store_data_new.reset_index()

Unnamed: 0,index,location,gender,type,quantity,total_bill,rating
0,0,Mumbai,M,Electronics,1,600,2
1,2,Bangalore,F,Beverages,4,150,3
2,3,Kolkata,M,Medicine,2,200,4
3,4,Pune,F,Beauty,1,80,4


* We see that the index of the data frame has now been reset, but the index has become a column in the data frame.
* We do not need the index to become a column so we can simply set the parameter **`drop=True`** in reset_index() function.
* We will also be setting `inplace` to `True` to make the changes permanent

In [268]:
store_data_new.reset_index(drop=True,inplace=True)
store_data_new

Unnamed: 0,location,gender,type,quantity,total_bill,rating
0,Mumbai,M,Electronics,1,600,2
1,Bangalore,F,Beverages,4,150,3
2,Kolkata,M,Medicine,2,200,4
3,Pune,F,Beauty,1,80,4


**Q11. Based on the Based on the revenue_df (created in Q6),
Add a new column named 'Profit Margin' to revenue_df, assuming a constant profit margin of 20% on all products.**

In [269]:
revenue_df['Profit Margin'] = revenue_df['Revenue'] * 0.20
revenue_df

Unnamed: 0,Product,Category,Price,Units Sold,Revenue,Profit Margin
0,Shirt,Apparel,25.99,120,3118.8,623.76
1,Pants,Apparel,40.5,85,3442.5,688.5
2,Shoes,Footwear,89.95,60,5397.0,1079.4
3,Hat,Accessories,15.99,150,2398.5,479.7
4,Socks,Apparel,5.99,200,1198.0,239.6


**Q12. How would you remove the 'Units Sold' column from revenue_df?**

In [270]:
revenue_df.drop('Units Sold', axis=1, inplace=True)
revenue_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin
0,Shirt,Apparel,25.99,3118.8,623.76
1,Pants,Apparel,40.5,3442.5,688.5
2,Shoes,Footwear,89.95,5397.0,1079.4
3,Hat,Accessories,15.99,2398.5,479.7
4,Socks,Apparel,5.99,1198.0,239.6


**Q13. Create a copy of revenue_df named backup_df.**


In [271]:
backup_df = revenue_df.copy()
backup_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin
0,Shirt,Apparel,25.99,3118.8,623.76
1,Pants,Apparel,40.5,3442.5,688.5
2,Shoes,Footwear,89.95,5397.0,1079.4
3,Hat,Accessories,15.99,2398.5,479.7
4,Socks,Apparel,5.99,1198.0,239.6


**Q15. How would you remove the first two rows from revenue_df?**


In [272]:
revenue_df.drop(revenue_df.index[:2], inplace=True)
revenue_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin
2,Shoes,Footwear,89.95,5397.0,1079.4
3,Hat,Accessories,15.99,2398.5,479.7
4,Socks,Apparel,5.99,1198.0,239.6


**Q16. After removing rows, how can you reset the index of revenue_df?**


In [273]:
revenue_df.reset_index(drop=True, inplace=True)
revenue_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin
0,Shoes,Footwear,89.95,5397.0,1079.4
1,Hat,Accessories,15.99,2398.5,479.7
2,Socks,Apparel,5.99,1198.0,239.6


### Pandas - Combining DataFrames

We will examine 3 methods for combining dataframes

1. concat
2. join
3. merge

In [274]:
data_cust = pd.DataFrame({"customerID":['101','102','103','104'],
                        'category': ['Medium','Medium','High','Low'],
                        'first_visit': ['yes','no','yes','yes'],
                        'sales': [123,52,214,663]},index=[0,1,2,3])

data_cust_new = pd.DataFrame({"customerID":['101','103','104','105'],
                    'distance': [12,9,44,21],
                    'sales': [123,214,663,331]},index=[4,5,6,7])

In [275]:
data_cust

Unnamed: 0,customerID,category,first_visit,sales
0,101,Medium,yes,123
1,102,Medium,no,52
2,103,High,yes,214
3,104,Low,yes,663


In [276]:
data_cust_new

Unnamed: 0,customerID,distance,sales
4,101,12,123
5,103,9,214
6,104,44,663
7,105,21,331


**Concat**

* ***concat*** concatenates dataframes along a particular axis

In [277]:
pd.concat([data_cust,data_cust_new],axis=0)

Unnamed: 0,customerID,category,first_visit,sales,distance
0,101,Medium,yes,123,
1,102,Medium,no,52,
2,103,High,yes,214,
3,104,Low,yes,663,
4,101,,,123,12.0
5,103,,,214,9.0
6,104,,,663,44.0
7,105,,,331,21.0


In [278]:
pd.concat([data_cust,data_cust_new],axis=1)

Unnamed: 0,customerID,category,first_visit,sales,customerID.1,distance,sales.1
0,101.0,Medium,yes,123.0,,,
1,102.0,Medium,no,52.0,,,
2,103.0,High,yes,214.0,,,
3,104.0,Low,yes,663.0,,,
4,,,,,101.0,12.0,123.0
5,,,,,103.0,9.0,214.0
6,,,,,104.0,44.0,663.0
7,,,,,105.0,21.0,331.0


**Merge and Join**

* ***merge*** combines dataframes using a column's values to identify common entries

* ***join*** combines dataframes using the index to identify common entries

In [279]:
pd.merge(data_cust,data_cust_new,how='outer',on='customerID') # outer merge is union of on

Unnamed: 0,customerID,category,first_visit,sales_x,distance,sales_y
0,101,Medium,yes,123.0,12.0,123.0
1,102,Medium,no,52.0,,
2,103,High,yes,214.0,9.0,214.0
3,104,Low,yes,663.0,44.0,663.0
4,105,,,,21.0,331.0


In [280]:
pd.merge(data_cust,data_cust_new,how='inner',on='customerID') # inner merge is intersection of on

Unnamed: 0,customerID,category,first_visit,sales_x,distance,sales_y
0,101,Medium,yes,123,12,123
1,103,High,yes,214,9,214
2,104,Low,yes,663,44,663


In [281]:
pd.merge(data_cust,data_cust_new,how='right',on='customerID')

Unnamed: 0,customerID,category,first_visit,sales_x,distance,sales_y
0,101,Medium,yes,123.0,12,123
1,103,High,yes,214.0,9,214
2,104,Low,yes,663.0,44,663
3,105,,,,21,331


In [282]:
data_quarters = pd.DataFrame({'Q1': [101,102,103],
                              'Q2': [201,202,203]},
                               index=['I0','I1','I2'])

data_quarters_new = pd.DataFrame({'Q3': [301,302,303],
                                  'Q4': [401,402,403]},
                               index=['I0','I2','I3'])

In [283]:
data_quarters

Unnamed: 0,Q1,Q2
I0,101,201
I1,102,202
I2,103,203


In [284]:
data_quarters_new

Unnamed: 0,Q3,Q4
I0,301,401
I2,302,402
I3,303,403


* `join` behaves just like merge,  except instead of using the values of one of the columns to combine data frames, it uses the index labels

In [285]:
data_quarters.join(data_quarters_new,how='right') # outer, inner, left, and right work the same as merge

Unnamed: 0,Q1,Q2,Q3,Q4
I0,101.0,201.0,301,401
I2,103.0,203.0,302,402
I3,,,303,403


Let's assume we have the following additional DataFrames related to retail_df:

* df_suppliers - Contains information about suppliers for each product.
* df_new_stock - Contains information about new stock received for each product.
* df_sales_region - Contains sales data broken down by region.

In [286]:
# df_suppliers
df_suppliers = pd.DataFrame({
    'Product': ['Shirt', 'Pants', 'Shoes', 'Hat', 'Socks'],
    'Supplier': ['Supplier A', 'Supplier B', 'Supplier C', 'Supplier D', 'Supplier E']
})

# df_new_stock
df_new_stock = pd.DataFrame({
    'Product': ['Shirt', 'Pants', 'Shoes', 'Hat', 'Socks'],
    'New Stock': [300, 200, 150, 400, 500]
})

# df_sales_region
df_sales_region = pd.DataFrame({
    'Product': ['Shirt', 'Pants', 'Shoes', 'Hat', 'Socks'],
    'Region A': [40, 30, 20, 50, 60],
    'Region B': [80, 55, 40, 100, 140]
})

**Q17 Perform an inner join between revenue_df and df_suppliers on the 'Product' column to combine sales data with supplier information.**

In [287]:
joined_df = revenue_df.join(df_suppliers.set_index('Product'), on='Product')
joined_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin,Supplier
0,Shoes,Footwear,89.95,5397.0,1079.4,Supplier C
1,Hat,Accessories,15.99,2398.5,479.7,Supplier D
2,Socks,Apparel,5.99,1198.0,239.6,Supplier E


**Q18. Concatenate retail_df and df_new_stock to extend the data vertically, assuming they have the same structure.**

In [288]:
concatenated_df = pd.concat([revenue_df, df_new_stock])
concatenated_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin,New Stock
0,Shoes,Footwear,89.95,5397.0,1079.4,
1,Hat,Accessories,15.99,2398.5,479.7,
2,Socks,Apparel,5.99,1198.0,239.6,
0,Shirt,,,,,300.0
1,Pants,,,,,200.0
2,Shoes,,,,,150.0
3,Hat,,,,,400.0
4,Socks,,,,,500.0


**Q19. Merge retail_df with df_sales_region on the 'Product' column to get a combined DataFrame with regional sales data.**

In [289]:
merged_df = pd.merge(revenue_df, df_sales_region, on='Product')
merged_df

Unnamed: 0,Product,Category,Price,Revenue,Profit Margin,Region A,Region B
0,Shoes,Footwear,89.95,5397.0,1079.4,20,40
1,Hat,Accessories,15.99,2398.5,479.7,50,100
2,Socks,Apparel,5.99,1198.0,239.6,60,140


### Pandas - Saving and Loading DataFrames

**Note**

In real-life scenario, we deal with much larger datasets that have thousands of rows and multiple columns. It will not be feasible for us to create datasets using multiple lists, especially if the number of columns and rows increases.

So, it is clear we need a more efficient way of handling the data simultaneously at the columns and row levels. In Python, we can import dataset from our local system, from links, or from databases and work on them directly instead of creating our own dataset.

**Loading a CSV file in Python**

**For Jupyter Notebook**
* When the data file and jupyter notebook are in the same folder.

In [290]:
# Using pd.read_csv() function will work without any path if the notebook and dataset are in the folder

# data = pd.read_csv('superkart.csv')

**For Google Colab with Google Drive**

First, we have to give google colab access to our google drive:

In [291]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Once we have access we can load files from google drive using read_csv() function.

In [292]:
path="/content/drive/MyDrive/Datasets/SuperKart.csv"
data=pd.read_csv(path)

In [293]:
# head() function helps us to see the first 5 rows of the data
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,03/02/2020 6:23:52
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,03/02/2020 6:23:52
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,03/02/2020 6:23:52
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,03/02/2020 6:23:52
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,03/02/2020 6:23:52


**Loading an excel file in Python**

In [294]:
path_excel="/content/drive/MyDrive/Datasets/SuperKart.xlsx"
data_excel = pd.read_excel(path_excel)

In [295]:
data_excel.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,2020-03-02 06:23:52
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,2020-03-02 06:23:52
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,2020-03-02 06:23:52
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,2020-03-02 06:23:52
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,2020-03-02 06:23:52


**Saving a dataset in Python**

**Saving the dataset as a CSV file**

To save a dataset as .csv file the syntax used is -

```data.to_csv('name of the file.csv', index=False)```

In [296]:
data.to_csv('/content/drive/MyDrive/Datasets/Saved_StockData.csv',index=False)

* In jupyter notebook, the dataset will be saved in the folder where the jupyter notebook is located.
* We can also save the dataset to a desired folder by providing the path/location of the folder.

**Saving the dataset as an Excel spreadsheet**

To save a dataset as .xlsx file the syntax used is -

```data.to_excel('name of the file.xlsx',index=False)```

In [297]:
data.to_excel('/content/drive/MyDrive/Datasets/Saved_StockData.xlsx',index=False)

### Pandas - Functions

**head() - to check the first 5 rows of the dataset**

In [298]:
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,03/02/2020 6:23:52
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,03/02/2020 6:23:52
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,03/02/2020 6:23:52
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,03/02/2020 6:23:52
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,03/02/2020 6:23:52


**tail() - to check the last 5 rows of the dataset**

In [299]:
data.tail()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
8758,FD3986,,Low Sugar,0.01,Fruits and Vegetables,,OUT002,1998.0,Small,Tier 3,Food Mart,,03/02/2020 6:23:52
8759,FD3637,12.89,Low Sugar,0.086,Fruits and Vegetables,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52
8760,FD4203,,Regular,0.074,Fruits and Vegetables,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52
8761,NC4330,,No Sugar,0.161,Health and Hygiene,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52
8762,FD6791,,Low Sugar,0.03,Snack Foods,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52


**shape - to check the number of rows and columns in the dataset**

In [300]:
data.shape

(8763, 13)

* The dataset has 5036 rows and 3 columns.

**info() - to check the data type of the columns**

In [301]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8736 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8736 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8755 non-null   float64
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8736 non-null   float64
 12  Date                       8763 non-null   object 
dtypes: float64(5), object(8)
memory usage: 890.1+ KB

* The price column is numeric in nature while the stock and date columns are of object types.

**unique() - to check the number of unique values that are present in a column**

In [302]:
data['Product_Type'].unique()

array(['Health and Hygiene', 'Meat', 'Canned', 'Baking Goods',
       'Snack Foods', 'Dairy', 'Fruits and Vegetables', 'Soft Drinks',
       'Seafood', 'Household', 'Others', 'Frozen Foods', 'Hard Drinks',
       'Starchy Foods', 'Breads', 'Breakfast'], dtype=object)

**Q20. Check the unique years with Store_Establishment_Year**

In [303]:
data['Store_Establishment_Year'].unique()

array([1998.,   nan, 1987., 2009., 1999.])

**value_counts() - to check the number of values that each unique quantity has in a column**

In [304]:
data['Product_Type'].value_counts()

Product_Type
Fruits and Vegetables    1249
Snack Foods              1149
Frozen Foods              811
Dairy                     796
Household                 740
Baking Goods              716
Canned                    677
Health and Hygiene        628
Meat                      618
Soft Drinks               519
Breads                    200
Hard Drinks               186
Others                    151
Starchy Foods             141
Breakfast                 106
Seafood                    76
Name: count, dtype: int64

**value_counts(normalize=True) - using the `normalize` parameter and initializing it to True will return the relative frequencies of the unique values.**

In [305]:
data['Product_Type'].value_counts(normalize=True)

Product_Type
Fruits and Vegetables    0.142531
Snack Foods              0.131119
Frozen Foods             0.092548
Dairy                    0.090836
Household                0.084446
Baking Goods             0.081707
Canned                   0.077257
Health and Hygiene       0.071665
Meat                     0.070524
Soft Drinks              0.059226
Breads                   0.022823
Hard Drinks              0.021226
Others                   0.017232
Starchy Foods            0.016090
Breakfast                0.012096
Seafood                  0.008673
Name: proportion, dtype: float64

**Q21. Get the count of unique values in each year**

In [306]:
data['Store_Establishment_Year'].value_counts()

Store_Establishment_Year
2009.0    4676
1987.0    1585
1999.0    1349
1998.0    1145
Name: count, dtype: int64

**Statistical Functions**

**min() - to check the minimum value of a numeric column**

In [307]:
data['Product_MRP'].min()

1.0

**max() - to check the maximum value of a numeric column**

In [308]:
data['Product_MRP'].max()

700.0

**mean() - to check the mean (average) value of the column**

In [309]:
data['Product_MRP'].mean()

147.1906730769231

**median() - to check the median value of the column**

In [310]:
data['Product_MRP'].median()

146.785

**mode() - to check the mode value of the column**

In [311]:
data['Product_MRP'].mode()

0    160.78
Name: Product_MRP, dtype: float64

**To access a particular mode when the dataset has more than 1 mode**

In [312]:
#to access the first mode
data['Product_MRP'].mode()[0]

160.78

**Group By function**
* Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.

In [313]:
data.groupby(['Product_Type'])['Product_MRP'].mean()

Product_Type
Baking Goods             146.884252
Breads                   148.878350
Breakfast                144.318491
Canned                   145.459320
Dairy                    148.691283
Frozen Foods             146.650173
Fruits and Vegetables    146.217090
Hard Drinks              144.897243
Health and Hygiene       146.793547
Household                147.402165
Meat                     147.132532
Others                   152.727000
Seafood                  148.177895
Snack Foods              147.865053
Soft Drinks              146.765703
Starchy Foods            153.630213
Name: Product_MRP, dtype: float64

* Here the groupby function is used to split the data into the 4 stocks that are present in the dataset and then the mean price of each of the 4 stock is calculated.

In [314]:
# similarly we can get the median price of each stock
data.groupby(['Product_Type'])['Product_MRP'].median()

Product_Type
Baking Goods             147.020
Breads                   146.965
Breakfast                144.415
Canned                   145.165
Dairy                    148.410
Frozen Foods             146.340
Fruits and Vegetables    146.020
Hard Drinks              144.820
Health and Hygiene       146.780
Household                146.430
Meat                     146.820
Others                   148.250
Seafood                  147.245
Snack Foods              147.550
Soft Drinks              147.630
Starchy Foods            152.560
Name: Product_MRP, dtype: float64

* Here the groupby function is used to split the data into the 4 stocks that are present in the dataset and then the median price of each of the 4 stock is calculated.

**Let's create a function to increase the price of the stock by 10%**

In [315]:
def profit(s):
    return s + s*0.10 # increase of 10%

**The Pandas apply() function lets you to manipulate columns and rows in a DataFrame.**

In [316]:
data['Product_MRP'].apply(profit)

0       34.100
1       46.024
2       55.143
3       55.462
4       57.090
         ...  
8758       NaN
8759       NaN
8760       NaN
8761       NaN
8762       NaN
Name: Product_MRP, Length: 8763, dtype: float64

* We can now add this updated values in the dataset.

In [317]:
data['new_price'] = data['Product_MRP'].apply(profit)
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date,new_price
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,03/02/2020 6:23:52,34.1
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,03/02/2020 6:23:52,46.024
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,03/02/2020 6:23:52,55.143
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,03/02/2020 6:23:52,55.462
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,03/02/2020 6:23:52,57.09


**Pandas sort_values() function sorts a data frame in ascending or descending order of passed column.**

In [318]:
data.sort_values(by='new_price',ascending=False) # by default ascending is set to True

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date,new_price
7854,FD6198,16.09,Low Sugar,0.015,Snack Foods,700.0,OUT003,1999.0,Medium,Tier 1,Departmental Store,5037.97,03/02/2020 6:23:52,770.0
6986,FD3645,12.90,Regular,0.025,Starchy Foods,500.0,OUT001,1987.0,High,Tier 2,Supermarket Type1,4036.91,03/02/2020 6:23:52,550.0
6890,NC5258,12.13,No Sugar,0.067,Health and Hygiene,350.0,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,3840.72,03/02/2020 6:23:52,385.0
14,NC730,6.96,No Sugar,0.153,Others,300.0,OUT002,1998.0,Small,Tier 3,Food Mart,369.31,03/02/2020 6:23:52,330.0
8735,NC7325,22.00,No Sugar,0.060,Household,266.0,OUT003,1999.0,Medium,Tier 1,Departmental Store,8000.00,03/02/2020 6:23:52,292.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8758,FD3986,,Low Sugar,0.010,Fruits and Vegetables,,OUT002,1998.0,Small,Tier 3,Food Mart,,03/02/2020 6:23:52,
8759,FD3637,12.89,Low Sugar,0.086,Fruits and Vegetables,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52,
8760,FD4203,,Regular,0.074,Fruits and Vegetables,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52,
8761,NC4330,,No Sugar,0.161,Health and Hygiene,,OUT004,2009.0,Medium,Tier 2,Supermarket Type2,,03/02/2020 6:23:52,


### Pandas - Date-time Functions

In [319]:
# reading the StockData
path="/content/drive/MyDrive/Datasets/Saved_StockData.csv"
data=pd.read_csv(path)

In [320]:
# checking the first 5 rows of the dataset
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,03/02/2020 6:23:52
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,03/02/2020 6:23:52
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,03/02/2020 6:23:52
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,03/02/2020 6:23:52
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,03/02/2020 6:23:52


In [321]:
# checking the data type of columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8736 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8736 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8755 non-null   float64
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8736 non-null   float64
 12  Date                       8763 non-null   object 
dtypes: float64(5), object(8)
memory usage: 890.1+ KB

* We observe that the date column is of object type whereas it should be of date time data type.

In [322]:
# converting the date column to datetime format
data['Date']  = pd.to_datetime(data['Date'],dayfirst=True)

In [323]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Product_Id                 8763 non-null   object        
 1   Product_Weight             8736 non-null   float64       
 2   Product_Sugar_Content      8763 non-null   object        
 3   Product_Allocated_Area     8763 non-null   float64       
 4   Product_Type               8763 non-null   object        
 5   Product_MRP                8736 non-null   float64       
 6   Store_Id                   8763 non-null   object        
 7   Store_Establishment_Year   8755 non-null   float64       
 8   Store_Size                 8763 non-null   object        
 9   Store_Location_City_Type   8763 non-null   object        
 10  Store_Type                 8763 non-null   object        
 11  Product_Store_Sales_Total  8736 non-null   float64       
 12  Date  

* We observe that the date column has been converted to datetime format

In [324]:
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,2020-02-03 06:23:52
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,2020-02-03 06:23:52
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,2020-02-03 06:23:52
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,2020-02-03 06:23:52
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,2020-02-03 06:23:52


**The column 'date' is now in datetime format. Now we can change the format of the date to any other format**

In [325]:
data['Date'].dt.strftime('%m/%d/%Y')

0       02/03/2020
1       02/03/2020
2       02/03/2020
3       02/03/2020
4       02/03/2020
           ...    
8758    02/03/2020
8759    02/03/2020
8760    02/03/2020
8761    02/03/2020
8762    02/03/2020
Name: Date, Length: 8763, dtype: object

In [326]:
data['Date'].dt.strftime('%m-%d-%y')

0       02-03-20
1       02-03-20
2       02-03-20
3       02-03-20
4       02-03-20
          ...   
8758    02-03-20
8759    02-03-20
8760    02-03-20
8761    02-03-20
8762    02-03-20
Name: Date, Length: 8763, dtype: object

**Extracting year from the date column**

In [327]:
data['Date'].dt.year

0       2020
1       2020
2       2020
3       2020
4       2020
        ... 
8758    2020
8759    2020
8760    2020
8761    2020
8762    2020
Name: Date, Length: 8763, dtype: int32

Creating a new column and adding the extracted year values into the dataframe.

In [328]:
data['year'] = data['Date'].dt.year

**Extracting month from the date column**

In [329]:
data['Date'].dt.month

0       2
1       2
2       2
3       2
4       2
       ..
8758    2
8759    2
8760    2
8761    2
8762    2
Name: Date, Length: 8763, dtype: int32

Creating a new column and adding the extracted month values into the dataframe.

In [330]:
data['month'] = data['Date'].dt.month

**Extracting day from the date column**

In [331]:
data['Date'].dt.day

0       3
1       3
2       3
3       3
4       3
       ..
8758    3
8759    3
8760    3
8761    3
8762    3
Name: Date, Length: 8763, dtype: int32

Creating a new column and adding the extracted day values into the dataframe.

In [332]:
data['day'] = data['Date'].dt.day

In [333]:
data.head()

Unnamed: 0,Product_Id,Product_Weight,Product_Sugar_Content,Product_Allocated_Area,Product_Type,Product_MRP,Store_Id,Store_Establishment_Year,Store_Size,Store_Location_City_Type,Store_Type,Product_Store_Sales_Total,Date,year,month,day
0,NC7411,9.0,No Sugar,0.03,Health and Hygiene,31.0,OUT002,1998.0,Small,Tier 3,Food Mart,253.53,2020-02-03 06:23:52,2020,2,3
1,FD5378,7.64,Low Sugar,0.019,Meat,41.84,OUT002,1998.0,Small,Tier 3,Food Mart,166.92,2020-02-03 06:23:52,2020,2,3
2,FD735,6.95,Low Sugar,0.079,Canned,50.13,OUT002,1998.0,Small,Tier 3,Food Mart,180.86,2020-02-03 06:23:52,2020,2,3
3,FD4245,7.02,Regular,0.029,Baking Goods,50.42,OUT002,1998.0,Small,Tier 3,Food Mart,203.55,2020-02-03 06:23:52,2020,2,3
4,FD6089,9.69,Regular,0.027,Snack Foods,51.9,OUT002,1998.0,Small,Tier 3,Food Mart,836.23,2020-02-03 06:23:52,2020,2,3


* We can see that year, month, and day columns have been added in the dataset.

In [334]:
# The datetime format is convenient for many tasks!
timedelta = data['Date'][1]-data['Date'][0] #calculating the difference in date time
print(data['Date'][1])
print(data['Date'][0])
print('timedelta :',timedelta) #difference is 0 since both datetimes are same.

2020-02-03 06:23:52
2020-02-03 06:23:52
timedelta : 0 days 00:00:00
