# Introduction to Pandas

- **What is Pandas?**
    - Pandas is an open-source Python library built on top of Numpy.
    - Pandas is used for Data Manipulation and Analysis

## Installation

In [36]:
# Installing Pandas
! pip install pandas



In [37]:
# Importing the pandas library
import pandas as pd

**Pandas Components**

- Series
    - One-Dimensional
    - It is like a column inside a table

- Data Frame
    - Multi-Dimensional
    - It is a table consisting multiple rows & columns

## Pandas Series

- How to create a Pandas Series?
    - pd.Series(data, index, dtype, copy)
        - data : Enter the data that goes inside series
        - index : Starts with 0 by default. It can be changed to anything, but the total index values should be same as total data points
        - dtype : The datatype of all the values. It can be string, integer, float, boolean, etc.
        - copy : Creates a copy of data

**Ways to create a Pandas Series**
- Using List
- Using Numpy Array
- Using Dictionary

In [38]:
# Creating series using list
age_series = pd.Series([19, 24, 30, 41, 53, 64])

# Print the series
age_series

0    19
1    24
2    30
3    41
4    53
5    64
dtype: int64

In [39]:
# Importing Numpy
import numpy as np

# Creating series using numpy array
age_series_np = pd.Series(np.array([30, 52, 43, 40, 50, 60]))

# Print the series
age_series_np

0    30
1    52
2    43
3    40
4    50
5    60
dtype: int32

- **Let's Change the index numbers**

In [40]:
salary_series_index = pd.Series(np.array([20000, 12000, 43000, 45000, 65000, 66000]),
                               index = np.arange(0,12,2))
salary_series_index

0     20000
2     12000
4     43000
6     45000
8     65000
10    66000
dtype: int32

- **Let's Change the index as strings**

In [41]:
emp_series = pd.Series(np.array([20000, 12000, 43000, 45000, 65000, 66000]),
                               index = ['A1','A2','A3','A4','A5','A6'])
emp_series

A1    20000
A2    12000
A3    43000
A4    45000
A5    65000
A6    66000
dtype: int32

In [42]:
# Creating Series using Dictionary
prod_dict = {'Dairy':23000, 'Soft Drinks': 45000, 'Fruits & Vegetables': 67000}
prod_series = pd.Series(prod_dict)
prod_series

Dairy                  23000
Soft Drinks            45000
Fruits & Vegetables    67000
dtype: int64

In [43]:
# Accessing only index names
prod_series.index

Index(['Dairy', 'Soft Drinks', 'Fruits & Vegetables'], dtype='object')

In [44]:
# Accessing only values
prod_series.values

array([23000, 45000, 67000], dtype=int64)

## Accessing Series Elements

- What if I just want to get only specific values inside a series?
    - [] operator is used to access specific value inside the series, using index reference. 
        - For example: if you want to access 5th element, it's default index value is 4. So, series_name[4] will return the 5th element inside the series

In [45]:
# Creating a series
emp_id = pd.Series(np.array(['A101','A102','A103','B101','B102','B103','C101','C102','C103']))
emp_id

0    A101
1    A102
2    A103
3    B101
4    B102
5    B103
6    C101
7    C102
8    C103
dtype: object

In [46]:
# Accessing the 5th element
emp_id[4]

'B102'

In [47]:
# Accessing first 4 elements
emp_id[:4]

0    A101
1    A102
2    A103
3    B101
dtype: object

In [48]:
# Accessing last 4 elements
emp_id[-4:]

5    B103
6    C101
7    C102
8    C103
dtype: object

- Note: If you want to access multiple elements inside the series, pass a list of elements you want to access

In [49]:
# Accessing multiple elements
emp_id[[3,4]]

3    B101
4    B102
dtype: object

## Filtering the series

In [51]:
# Creating a series of salary
salary_array = np.array([20000, 12000, 43000, 45000, 65000])
emp_name_array = np.array(['Alice','Bob','Charlie','David','Emma'])

sal_series = pd.Series(data = salary_array, index = emp_name_array)

# Filtering the employees with salary more than 40000
sal_series[sal_series > 40000]

Charlie    43000
David      45000
Emma       65000
dtype: int32

## Arithmetic Operations

### Multiplication

- Multiplication of single series

In [52]:
# Creating a series and multiplying it with 2
simple_series = pd.Series(np.array([12,48,64]))
simple_series*5

0     60
1    240
2    320
dtype: int32

- Multiplication of 2 series

In [21]:
# multiply() function is used to multiply 2 series
serie1 = pd.Series([124,23,15])
serie2 = pd.Series([3,12,16])
serie1.multiply(serie2)

0    372
1    276
2    240
dtype: int64

### Addition
- Using + operator

In [22]:
serie1 + serie2

0    127
1     35
2     31
dtype: int64

## Ranking & Sorting

### Ranking

**Note: By default, Ranking is in ascending order**

In [61]:
rank_series = pd.Series([19, 24, 30, 41, 53, 64, 30])
rank_series.rank()

0    1.0
1    2.0
2    3.5
3    5.0
4    6.0
5    7.0
6    3.5
dtype: float64

In [24]:
series_a = pd.Series([41, 24, 18, 53, 64, 30])
series_a.rank()

0    4.0
1    2.0
2    1.0
3    5.0
4    6.0
5    3.0
dtype: float64

In [26]:
# Ranking in descending order
series_a.rank(ascending=False)

0    3.0
1    5.0
2    6.0
3    2.0
4    1.0
5    4.0
dtype: float64

### Sorting
**Note : By default, sorting is also in ascending order & null values are sorted at the last positions**

In [27]:
series_b = pd.Series([41, 24, np.nan, 18, 53, 64, 30, np.nan])
series_b.sort_values()

3    18.0
1    24.0
6    30.0
0    41.0
4    53.0
5    64.0
2     NaN
7     NaN
dtype: float64

In [54]:
# Sorting in descending order and putting the null values at the top
print(series_b)
series_b.sort_values(ascending=False, na_position='first')

0    41.0
1    24.0
2     NaN
3    18.0
4    53.0
5    64.0
6    30.0
7     NaN
dtype: float64


2     NaN
7     NaN
5    64.0
4    53.0
0    41.0
6    30.0
1    24.0
3    18.0
dtype: float64

## Checking Null Values
- **isnull()** : This function is used to check if the value is null or not. Returns True if null & False if not null
- **notnull()** : This function is used to check if the value is null or not. Returns True if not null & False if null

In [55]:
# Using isnull()
print(series_b)
series_b.isnull()

0    41.0
1    24.0
2     NaN
3    18.0
4    53.0
5    64.0
6    30.0
7     NaN
dtype: float64


0    False
1    False
2     True
3    False
4    False
5    False
6    False
7     True
dtype: bool

In [31]:
# Using notnull()
series_b.notnull()

0     True
1     True
2    False
3     True
4     True
5     True
6     True
7    False
dtype: bool

## Concatenate a Series

In [32]:
# Creating 2 series
class1 = np.linspace(0,100,20)
class2 = np.linspace(1,101,20)
series1 = pd.Series(class1)
series2 = pd.Series(class2)

# Using pandas concat method
pd.concat([series1, series2])

0       0.000000
1       5.263158
2      10.526316
3      15.789474
4      21.052632
5      26.315789
6      31.578947
7      36.842105
8      42.105263
9      47.368421
10     52.631579
11     57.894737
12     63.157895
13     68.421053
14     73.684211
15     78.947368
16     84.210526
17     89.473684
18     94.736842
19    100.000000
0       1.000000
1       6.263158
2      11.526316
3      16.789474
4      22.052632
5      27.315789
6      32.578947
7      37.842105
8      43.105263
9      48.368421
10     53.631579
11     58.894737
12     64.157895
13     69.421053
14     74.684211
15     79.947368
16     85.210526
17     90.473684
18     95.736842
19    101.000000
dtype: float64

**Note that the index is repeated. Let's check how to solve that.**

In [57]:
pd.concat([series1, series2], ignore_index=True)

0       0.000000
1       5.263158
2      10.526316
3      15.789474
4      21.052632
5      26.315789
6      31.578947
7      36.842105
8      42.105263
9      47.368421
10     52.631579
11     57.894737
12     63.157895
13     68.421053
14     73.684211
15     78.947368
16     84.210526
17     89.473684
18     94.736842
19    100.000000
20      1.000000
21      6.263158
22     11.526316
23     16.789474
24     22.052632
25     27.315789
26     32.578947
27     37.842105
28     43.105263
29     48.368421
30     53.631579
31     58.894737
32     64.157895
33     69.421053
34     74.684211
35     79.947368
36     85.210526
37     90.473684
38     95.736842
39    101.000000
dtype: float64

In [34]:
# Using series append method
series1.append(series2)

  series1.append(series2)


0       0.000000
1       5.263158
2      10.526316
3      15.789474
4      21.052632
5      26.315789
6      31.578947
7      36.842105
8      42.105263
9      47.368421
10     52.631579
11     57.894737
12     63.157895
13     68.421053
14     73.684211
15     78.947368
16     84.210526
17     89.473684
18     94.736842
19    100.000000
0       1.000000
1       6.263158
2      11.526316
3      16.789474
4      22.052632
5      27.315789
6      32.578947
7      37.842105
8      43.105263
9      48.368421
10     53.631579
11     58.894737
12     64.157895
13     69.421053
14     74.684211
15     79.947368
16     85.210526
17     90.473684
18     95.736842
19    101.000000
dtype: float64

**Q. You have a Pandas Series of integers. Write a function that returns a new Series containing only the even numbers from the original Series, but with their positions (index) unchanged.**

In [64]:
ser = pd.Series(np.arange(1,25,3))
ser

0     1
1     4
2     7
3    10
4    13
5    16
6    19
7    22
dtype: int32

In [66]:
ser1 = ser[ser%2==0]
ser1

1     4
3    10
5    16
7    22
dtype: int32

**Q. Given two Pandas Series of different lengths, align them by their index and add them together. If an index exists in only one of the Series, the result should keep the value from the existing Series without throwing an error.**

In [68]:
a = pd.Series(np.arange(1,25,6))
b = pd.Series(np.linspace(1,12,3))
print(a,b)

0     1
1     7
2    13
3    19
dtype: int32 0     1.0
1     6.5
2    12.0
dtype: float64


In [73]:
a.add(b,fill_value=0)

0     2.0
1    13.5
2    25.0
3    19.0
dtype: float64