<a href="https://colab.research.google.com/github/YahyaMansoor/Data-Analytics-exercises/blob/main/Pandas_class_v0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Series

---

###  Activities

Before we go ahead with more concepts on Machine Learning and Artificial Intelligence, let's first learn a bit of Data Analysis, so that you can get a better understanding of data. Every dataset tells you a story if you look at it through the right lenses.

To give you a perspective, imagine that you have data on the number of sales happening in every month of every single shop in your city. You would notice that during the festivals, the sales volume of sweets rises by a tremendous magnitude. Similarly, the sales volume of clothes, jewellery and electronic products also rose significantly in this period.

If you have the tourism data, then you would see that a lot of people in India go on a vacation in the months of May and June, which makes sense because schools are closed in these two months due to summer vacation.

Through data, you can observe a trend and based on that trend you can draw meaningful insights, helping you in making decisions in your daily life, in business organisations, in medical and engineering applications etc. 

When it comes to Data Analysis in Python, we use a module called Pandas which is specifically designed to manipulate, manage and analyse a huge amount of data by creating Pandas Series and Pandas DataFrames. 

In this lesson, we will learn about the Pandas Series.

---

#### Pandas Series 

A Pandas series is a one-dimensional array which can hold various data types. It is similar to a Python list and a NumPy array.

Without going too much into the theory, let's get started with the Pandas series right away. At the end of the class, we will learn when to use a Python list, a NumPy array and a Pandas series.

---

#### Activity 1: Python List To Pandas Series Conversion

Let's understand the Pandas series through an example. Suppose there are `30` students in your class and their weights vary in the range of `45` to `60` kg (both inclusive).

We can create a Pandas series containing the weights of the students by first creating a Python list and then converting it to a Pandas series. To create a Pandas series, you have to first import the `pandas` module using the `import` keyword. 

```
import pandas as pd
```

Here, `pd` is an alias (or nickname) for `pandas`

Then you can call the `Series()` function to convert a Python list or a NumPy array into a Pandas series.

```
weights = pd.Series([random.randint(45, 60) for i in range(30)])
```

**Note:** Unlike other functions, the `Series()` function begins with the uppercase letter `S`.

In [None]:
#Create a Pandas series containing 30 random integers between 45 and 60.
import random
import pandas as pd

weights = pd.Series([random.randint(45,60) for i in range(30)])
print(weights)

0     53
1     47
2     47
3     57
4     51
5     55
6     56
7     47
8     46
9     54
10    49
11    54
12    54
13    58
14    48
15    48
16    59
17    56
18    59
19    55
20    52
21    55
22    52
23    57
24    48
25    60
26    55
27    54
28    58
29    56
dtype: int64


The first column in the output represents the indices of all the items in the `weights` Pandas series. The second column contains the weights of the students. The data-type of each item is an `int`.

**Note:** Ignore the `64` in the `int64` for the time being. 

Using the `Series()` function, you can convert any one-dimensional Python list into a Pandas series. Now, let's verify whether `weights` is a Pandas series or not.

In [None]:
# Verify the type of value stored in the 'weights' variable using the 'type()' function.
type(weights)

pandas.core.series.Series

The `type()` function returns `pandas.core.series.Series` as an output which confirms that `weights` is indeed a Pandas series. 

A Pandas series can also contain the items of multiple data-types. Recall that in the trial class we created 4 different variables to store the attributes of a planet.  

||Mercury|
|-|-|
|Diameter (km)|4879|
|Gravity ($m/s^2$)|3.7|
|Ring|No|


Let's store the name of a planet, its diameter, gravity and whether it has a ring or not in a Python list and then convert it into a pandas series.


In [None]:
# Create a Python list which contains planet name, diameter, gravity and False if the planet has a ring.
# Convert the list into a Pandas series. Also, verify whether the list successfully is converted to a Pandas series or not.
planet_details = pd.Series(['Mercury', 4879, 3.7, False])
print(planet_details)
type(planet_details)

0    Mercury
1       4879
2        3.7
3      False
dtype: object


pandas.core.series.Series

Here the data-type is `object`. Pandas cannot return the data-type of every individual item. Hence, it has returned `object` data-type to represent one common data-type for all the items.

You can also use the `size` keyword to find the number of items in a Pandas series.

In [None]:
# Find the number of items in the 'weights' Pandas series using the 'size' keyword.
weights.size

30

So, there are `30` items in the `weights` Pandas series. 

You can also use the `shape` keyword to find the number of rows and columns in a Pandas series.

In [None]:
# Find the number of rows and columns in the 'weights' Pandas series using the 'shape' keyword.
weights.shape

(30,)

So, there are `30` rows and `1` column in the `weights` Pandas series.

---

#### Activity 2: The `mean(), min(), max()` Functions 

The `mean()` function does not take any input and returns the average value of all the items as an output.

To apply this function, you need to write the Pandas series; whose mean value you need to compute; followed by the dot (`.`) operator.


In [None]:
# Calculate the average value of all the numbers in a Pandas series.
weights.mean()

52.56666666666667

Similarly, you can also find the minimum and maximum values in a Pandas series using the `min()` and `max()` functions.

In [None]:
# Using the 'min()' and 'max()' functions, print the minimum and maximum values in the 'weights' Pandas series.
print(weights.min())
weights.max()

45


59

---

#### Activity 3: The `head()` And `tail()` Functions

Sometimes instead of looking at the full dataset, we just want to look at the first few rows or the last few rows of the dataset. In such cases, we can use the `head()` and `tail()` function.

The `head()` function shows the first five and the `tail()` function shows the last five items in a Pandas series.

In [None]:
#Print only the first 5 items in a Pandas series using the 'head()' function.
weights.head(10)

0    58
1    54
2    57
3    55
4    55
5    46
6    57
7    46
8    51
9    58
dtype: int64

The numbers in the first column in the output are the indices of each item in the Pandas series. Since we print the first five items, the indices range from `0` to `4`.

In [None]:
# Using the 'tail()' function, print the last 5 items in the Pandas series.
weights.tail()

25    50
26    46
27    59
28    59
29    57
dtype: int64

Since we printed the last five items of the series, the indices range from `25` to `29`.

Within the `head()` and `tail()` functions, you can specify the `n` number of first items and the `n` number of last items you wish to see in a Pandas series.

---

#### Activity 4: Indexing A Pandas Series^

Indexing a Pandas series is the same as indexing a Python list or a NumPy array. 

Let's say we want to get the weights of the students whose indices range from `13` to `21`, then you can write the variable storing the Pandas series followed by the square brackets `[]`. Inside the square brackets, you can mention the range of items you wish to retrieve from a series.

**Syntax:** `pandas_series[start_index:end_index]`

In [None]:
# Retrieve items from a Pandas series using the indexing method.
weights[13:22]

13    47
14    53
15    53
16    52
17    59
18    46
19    58
20    45
21    57
dtype: int64

#### Activity 5: The `mode()` Function

Let's say you want to find out the weights of the most number of students in your class, then you can use the `mode()` function.

In [None]:
#Compute the modal value in the 'weight' series.
weights.mode()

0    46
dtype: int64

**Note**: A dataset can have more than one modal value. 


---

#### Activity 6: The `sort_values()` Function^^

We can use the `sort_values()` function to arrange the numbers in a Pandas series either in an ascending order or in descending order.

To arrange the numbers in a Pandas series in the increasing order, use the `sort_values()` function with the `ascending=True` as an input.

In [None]:
# Arrange the weights in the increasing order using the 'sort_values()' function.
weights.sort_values(ascending=True)

8     46
1     47
2     47
7     47
14    48
24    48
15    48
10    49
4     51
22    52
20    52
0     53
12    54
27    54
9     54
11    54
26    55
5     55
21    55
19    55
6     56
17    56
29    56
23    57
3     57
28    58
13    58
16    59
18    59
25    60
dtype: int64

To arrange the numbers in a Pandas series in the decreasing order, use the `sort_values()` function with the `ascending=False` as an input.

In [None]:
# Using the 'sort_values()' function, arrange the weights in the decreasing order.
weights.sort_values(ascending=False)

28    59
27    59
17    59
0     58
9     58
19    58
21    57
29    57
2     57
6     57
12    56
22    56
4     55
3     55
1     54
15    53
14    53
16    52
23    51
8     51
25    50
13    47
7     46
18    46
24    46
26    46
5     46
20    45
11    45
10    45
dtype: int64

---

#### Activity 7: The `median()` Function

To find the median value in a Pandas series, we can simply use the `median()` function.


In [None]:
# Using the 'median()' function, find the median weight in the weights series.
weights.median()

53.5

---

#### Activity 8: The `value_counts()` Function^^^

To count the number of occurrences of an item in a Pandas series, you can use the `value_counts()` function.

In [None]:
# Count the number of times each item in the 'weights' Pandas series occurs.
weights.value_counts()

46    5
57    4
58    3
45    3
59    3
55    2
51    2
56    2
53    2
54    1
47    1
52    1
50    1
dtype: int64

**Note:** The `value_counts()` function is not available for Python lists and NumPy arrays.

There is more to the Pandas series. We will learn it in detail along with Pandas DataFrames from the next class onwards. This is just an introductory class to Pandas.

---

#### Python List vs NumPy Array vs Pandas Series

You might now wonder when to use a Python list, a NumPy array and a Pandas series?

There are no hard rules to decide when to use which one of these data-structures but as a guide you may consider the following:

1. When you just want to store data, retrieve data and add more data, use a Python list.

2. When you want to store a numerical data (one-dimensional or multidimensional) and want to perform a lot of mathematical operations, then use a NumPy array as it faster than a Python list and it is easy to create a multidimensional array using a NumPy array.

3. When you want to import data from an external file such as `TXT, XLXS, CSV, XML` etc then use a Pandas series. In the next class, you will learn how to import data from an external file. Additionally, Pandas allow you to interpret data in different ways. It also allows you to do complicated data extraction, manipulation and data processing operations on a dataset. Throughout this course, we will use Pandas library to handle data.

---

---

#### Activity 1: Calculate Mean

Write a function to calculate the mean of all the numbers contained in a list/array/series.

In [None]:
# Solution
def compute_mean(num_series):
  result = 0
  for num in num_series:
    result = result + num
  return result / len(num_series)

num_series = pd.Series([i for i in range(1, 11)])
print(num_series)
print("The required mean is", compute_mean(num_series))

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64
The required mean is 5.5


---

#### Activity 2: Calculate Median

Write a function to calculate the median of all the numbers contained in a list/array/series.

In [None]:
# Solution
def compute_median(num_series):
  result = 0
  num_list = [num for num in num_series]
  num_list.sort()
  if len(num_list) % 2 == 0:
    result = (num_list[(len(num_list) // 2) - 1] + num_list[len(num_list) // 2]) / 2
  else:
    result = float(num_list[len(num_list) // 2])
  return result

num_series = pd.Series([random.randint(30, 60) for i in range(10)])
print("Median without using the standard median() function:", compute_median(num_series))
print("Median using the standard median() function:", num_series.median())

Median without using the standard median() function: 46.5
Median using the standard median() function: 46.5


---

#### Activity 3: The `head()` & `tail()` Functions With Negative Inputs

Let `pd_series` be a Pandas series which contains `N` number of elements. Let `n` be some positive integer such that `n < N`.

The `pd_series.head(-n)` operation will return the **first** `N - n` items contained in the `pd_series`. 



In [None]:
# 'head()' with negative input.
weights.head(-8)

0     53
1     47
2     47
3     57
4     51
5     55
6     56
7     47
8     46
9     54
10    49
11    54
12    54
13    58
14    48
15    48
16    59
17    56
18    59
19    55
20    52
21    55
dtype: int64

The `pd_series.tail(-n)` operation will return the **last** `N - n` items contained in the `pd_series`. 

In [None]:
# 'tail()' with negative input.
weights.tail(-8)

8     60
9     48
10    60
11    59
12    56
13    51
14    58
15    55
16    48
17    55
18    48
19    57
20    56
21    56
22    56
23    50
24    58
25    55
26    50
27    47
28    50
29    52
dtype: int64

---