# Python pandas

## Introduction to Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. The name of this library is derived from the phrase 'Panel Data' which is a term used in Econometrics which means Datasets that include multiple observations over multiple periods of time.\
\
Unlike NumPy or matplotlib, pandas is NOT one of the essential components of the Python scientific computing tools. It is used largely used by the Data Science community because it makes the work much faster and easier. pandas is built on top of NumPy and it steps on its computational abilities. In that sense, pandas is library that takes advantage of the NumPy array structure, which contributes to the fast execution of mathematical operations. Hence, pandas requires NumPy.

| Series | DataFrame |
| :-: | :-: |
|Single-column data | Multi-column data |
| A set of ebservations related to a single variable. | A collection of Series objects, which contains observations related to one or several variables. Hence, the information is organised into rows and columns. |
| Corresponds to 1D array structure from NumPy | Corresponds to 2D structure from NumPy |

\
A certain tool that is applicable to a Series will most probably be relavant to use on DataFrame as well or vice versa. However, there can be exceptions, which we will discover as we practise and learn further.

### Numpy or pandas, who first?
From a *programming perspective*, ```NumPy``` comes first and then ```pandas```, but in the practise of a **data analyst** it's usually the other way around as ```pandas``` focus more on your analytical task and less in the underlying mathematical computations. You'd normally use ```pandas``` to import the data into Python as pandas allows you to easily manipulate data until you've obtained a clean, well-organized dataset, that is ready for preprocessing. Once that is done, the dataset is ready for mathematical computations, which is when, using ```Numpy``` to preprocess data makes sense.

### What problems does ```pandas``` solve?
- pandas makes the experience of doing Data Analysis in Python much faster and easier.
- it enhances analytical work if your dataset contains missing values or when you have to combine information across several datasets
- pandas has the ability to import data from and export data to an extensive set of file formats
- while NumPy is the library to opt for when it some to dealing numerical calculations, pandas has specifically designed to help you if have datasets containing multiple types of information.
- Preserves Data Consistency: The general rule to maintaning consistent data is that you must have data values of a single type stored in your Series object or in each column of your DataFrame.

## Installation, Updating and Importing
```pip install pandas```

To upgrade to the latest version of pandas:\
```pip install pandas --upgrade```

```pandas``` is usually imported under the pd alias

In [1]:
import pandas as pd

In [2]:
#checking pandas version
pd.__version__

'2.3.2'

## Introduction to pandas Series

In [3]:
# We can create a 'Series' object from a list
products = ['A', 'B', 'C', 'D']
products

['A', 'B', 'C', 'D']

In [4]:
type(products)

list

In [5]:
new_products = pd.Series(products)

In [6]:
new_products

0    A
1    B
2    C
3    D
dtype: object

In [7]:
# we see that 4 letters from the 'products' list have been organized into a column. 
# the set of values displayed to the left of the letters, those number represent the index values of the Series
# dtype being 'object' is the default datatype assigned to data which is not numeric

In [8]:
type(new_products)

pandas.core.series.Series

In [9]:
# numeric data
daily_expenses = pd.Series([40, 45, 50, 60, 35])
daily_expenses

0    40
1    45
2    50
3    60
4    35
dtype: int64

In [10]:
print(daily_expenses)

0    40
1    45
2    50
3    60
4    35
dtype: int64


In [11]:
# pandas Series object corresponds to the 1-D NumPy array structure
import numpy as np

In [12]:
arr1 = np.array([10, 20, 30, 40, 50])
arr1

array([10, 20, 30, 40, 50])

In [13]:
type(arr1)

numpy.ndarray

In [14]:
series1 = pd.Series(arr1)
series1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [15]:
type(series1)

pandas.core.series.Series

#### Key Takeaways:
1. The ```pandas``` Series object is something like a powerful version os the Python list, or an enhanced version of the NumPy array.\
This doesn't mean that 'Series' should be the preferred choice between the three no matter what.\
Since there's always a trade-off in terms of:

> *what you want to obtain from you data* **vs**\
> *the speed and precision with which you can do that*

However, if in a given situation, you have opted for using a 'Series', this will entail working with a larger set of tools and capabilities that are pertinent to the ```pandas library``` only. These tools and capabilities are often related to the fact that the 'Series' object stores it values in a sequenced order, and has an *explicit index*.

2. Remember to always maintain ***data consistency***

## Working with Attributes in Python

In [16]:
series1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [17]:
series1.dtype

dtype('int64')

In [18]:
series1.size

5

In [19]:
new_products

0    A
1    B
2    C
3    D
dtype: object

In [20]:
new_products.dtype

dtype('O')

In [21]:
new_products.size

4

#### Attributes of certain Class coming to play
We already know that *attributes* deliver information about a given object, but how do they do that?\
They do that by returning a seperate Python object. In fact, the benefit of this is that you can use the returned object in larger expressions about programming.\
For instance, if we ask for the *size* of any Series as an argument of the *type* function, we will recieve the output as 'int' saying that the value is an integer.

In [22]:
type(new_products.size)

int

Therefore, the number of elements stored in the variable can be referred to in further calculations by using the short expression.\
Another peculiarity, when using attributes, is that at a certain moment in time, they may contain no specific value. Here's an example

In [23]:
new_products.name

In [24]:
print(new_products.name)

None


Something to keep in mind:
> 'new_products' - the name we can use to refer to that object in our code.\
> Object name - the one you want to see whenever displaying its contents.

In other words, you can say that you are working the *new_products* Object in your program but the name you want to associate it with its data can be a different one.

In [25]:
new_products.name = "New Products"
new_products

0    A
1    B
2    C
3    D
Name: New Products, dtype: object

In [26]:
print(new_products)

0    A
1    B
2    C
3    D
Name: New Products, dtype: object


The attributes related to a certain Python object allow us to extract information about it.\
However, they are not meant to alter or modify its content in any way.

## Using an Index in pandas

In Pandas, the index is a crucial component of both Series and DataFrame objects, serving as a set of labels that identify each row (or column in the case of Series when transposing). It provides a mechanism for efficient data access, alignment, and identification.\
\
While explaining the use of an Index, we will work with a Series created from a *Dictionary* as opposed to a list or array.

In [27]:
prices_per_category = {'Product A': 22250, 'Product B': 16600, 'Product C': 15600}
prices_per_category

{'Product A': 22250, 'Product B': 16600, 'Product C': 15600}

In [28]:
type(prices_per_category)

dict

In [29]:
prices_per_category = pd.Series(prices_per_category)
prices_per_category

Product A    22250
Product B    16600
Product C    15600
dtype: int64

In [30]:
type(prices_per_category)

pandas.core.series.Series

We can see that the *keys* have been translated to *index values*. These indices act as access labels that can explicitly indicate the integer values we have on the right side.\
This is exactly what a Series index needs to do.\
It contains *Index Values* that point to the relevant *Data Points*, stored in a Series.

The Series *index* is seperate object.

In [31]:
prices_per_category.index

Index(['Product A', 'Product B', 'Product C'], dtype='object')

In [33]:
type(prices_per_category.index)

pandas.core.indexes.base.Index

Using indexes has many practical applications when working with pandas. Few crucial points to remember:
1. An *index* allows you to refer to a position within a sequence, or, in other words, a set of values in a sequenced order.
2. You will be able to quickly access the prices of the relevant categories through their respective *indices*
3. The *index* data stucture will often turn out to be a way to speed up your computations while working with large datasets.

## Label based vs Position bases Indexing

In pandas, you can also have *implicit indices*. 

## More on working with Indices in Python

## Using Method in Python

## Parameters vs Arguments