# Introduction
This notebook is inteneded as a learning tool for fellow data scientists and data science enthusiasts. In this notebook, I will be talking about the pandas library. The material here is going to be taken from several sources. I will try my best to cite all used sources and give credit where due.

## What is pandas?
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. (https://pandas.pydata.org/)


## Prerequisites
Although I will start  from the very basic ideas, the reader is expected to have a fairly decent understanding of python. Afterall, this is a library built on python.


I hope this somehow helps someone out there! Good luck!

~ Ahmed


# Data structures


I really like to learn the basic data structure of anything new. This allows us to view the functionality and the operations from the perspective of the developers who wrote the library. Without a strong understanding of data structures, using pandas would require frequent "googling" (not saying that after understanding it you would be completely independent of google!). 

I guess the point here is that once we see how the data is handled from the prespective of the library, things make way more sense and it sort of becomes intuitive to work with it. Enough mumbling! Let's get on with it!

Most resources for this section comes from pandas docs (https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro)

There are two main data structures in pandas. They are:
- **Series**
- **Data frames**

## Series
A series is essentially a **one dimensional labelled array** capable of holding multiple data types. Here, the term labelled means that it has a named index. Let's explore with examples.


In [4]:
# import the libraries
import numpy as np

import pandas as pd

In terms of python, a series can be a good metaphor for a **list or a dict**. It acts as a list if the index is not defined.

In [33]:
# this will create a pandas series of 5 random  values. The axis labels (or index) are given as an array of labels
s_unindexed = pd.Series(np.random.randint(low = 0, high = 10, size = 5))

# show the entire series
print(f'The series is \n{s_unindexed}\n')

# show the first index
print(f'The first value of the series is {s_unindexed[0]}')


The series is 
0    2
1    8
2    0
3    7
4    2
dtype: int32

The first value of the series is 2


It acts as a dict if an index *is* defined

In [34]:
# this will create a pandas series of 5 random  values. The axis labels (or index) are given as an array of labels
s_indexed = pd.Series(np.random.randint(low = 0, high = 10, size = 5), index=["a", "b", "c", "d", "e"])
# show the entire series
print(f'The series is \n{s_indexed}\n')

# show the first index
print(f'The first value of the series accessed by it\'s index label: {s_indexed["a"]}')
print(f'The first value can also be accessed by index number: {s_indexed[0]}')



The series is 
a    6
b    8
c    0
d    0
e    7
dtype: int32

The first value of the series accessed by it's index label: 6
The first value can also be accessed by index number: 6


To get all index labels, we do the following:

In [35]:
print(f'Index of unindexed series {s_unindexed.index}')
print(f'Index of indexed series {s_indexed.index}')

Index of unindexed series RangeIndex(start=0, stop=5, step=1)
Index of indexed series Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


### Prototype and data
Now that we have a basic understanding of what a series is, we can look at how it can be defined and what can be done with it.

The prototype of a series is as follows

### <p style="color:#0331A1; text-align: center;"> sample_series = pd.series(data, index)</p>


prototypes:
- data: The contents of the series. Possible values:
    - scalar value (eg 5)
    - a list
    - a python dict
- index: A list containing the index labels of the series


When creating a series using a dict, we don't need to provide the index. 

In [46]:
person = {'name': 'Samantha', 'age': 33, 'height': '5ft 11in'}
person_series = pd.Series(person)
print(f'The person\'s data is {person_series}')
print(f'The person\'s age is {person_series["age"]}')

The person's data is name      Samantha
age             33
height    5ft 11in
dtype: object
The person's age is 33


However, if you still do that and the indices are missing, it would be added as a NaN value.

If there is an attribute in the dictionary but not in the index variable, it will be removed from the series. Hence, make sure that things match here if you are doing this.

In [48]:
person_series = pd.Series(person, index=['name', 'age', 'weight'])
print(person_series)
# notice how the height attribute in the original dictionary has been removed.

name      Samantha
age             33
weight         NaN
dtype: object
