# Introduction to Pandas

Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. 

Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame.

[link to the doc](https://pandas.pydata.org/pandas-docs/stable/)

## Why Pandas?

Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis.

Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Here some features:
- Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib


## Creating Pandas Series

In [4]:
import pandas as pd

In [35]:
my_series = pd.Series(data = [1, 20, True, "Banana"], index = ["index1","index2","index3","index4"])

In [36]:
my_series

index1         1
index2        20
index3      True
index4    Banana
dtype: object

In this way, we have created a Series where the type is `object` (indeed, we have numbers, string and bool types) and the first column is the list of indices while the second one is the list of values related to that indices. 

The creation of a series is pretty simple because it just requires us to provide the data and the indiex for each element in the data array.

We can also pass as input the our predefined data and index generated for instance using numpy.

In [37]:
import numpy as np
import string
import random

num_indices = 10

my_list = np.arange(num_indices)
my_indices = [random.choice(string.ascii_letters + string.digits) for n in range(num_indices)]

In [38]:
my_random_series = pd.Series(data = my_list, index = my_indices)
my_random_series

N    0
r    1
H    2
0    3
V    4
8    5
W    6
e    7
z    8
h    9
dtype: int64

In the previous examples, the series's values are indexed with strings/chars . The index parameter is optional and by default the Series is created with a numeric indices

In [39]:
my_series_no_index = pd.Series(data = my_list)
my_series_no_index

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Just like NumPy ndarrays, Pandas provides several attributes to get information related from the series. In particular:

To get information about the shape 

In [40]:
my_series.shape

(4,)

Get the number of dimension of the Series

In [41]:
my_series.ndim

1

Get the total number of number of dimension of the Series

In [42]:
my_series.size

4

Get only the Series's values

In [43]:
my_series.values

array([1, 20, True, 'Banana'], dtype=object)

Get the Series's indices

In [44]:
my_series.index

Index(['index1', 'index2', 'index3', 'index4'], dtype='object')

It's possibile di directly access to a value of the Series using its index.

In [46]:
my_series.index4

'Banana'

Pandas allows checking if a index label exists or not using the `in` command.  

In [52]:
'Banana' in my_series

False

In [54]:
'index4' in my_series

True

this operation can be written also like that

In [56]:
'Banana' in my_series.index

False

And we can do the same with the values list of the Series

In [57]:
'Banana' in my_series.values

True