# pandas - Python Data Analysis Library
## Introduction
**Pandas** is a powerful Python library for data manipulation and analysis, you can find all the help you need directly from [Pandas' website](https://pandas.pydata.org).  
It provides two main data structures: `Series` and `DataFrame`.

In this notebook, we will learn how to:
- Create and manipulate Series and DataFrames
- Load data from CSV files
- Perform basic statistical operations
- Filter and visualize data

In [2]:
import sys
!{sys.executable} -m pip install pandas



In [3]:
import pandas as pd
import numpy as np

## Creating a Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

In [4]:
s = pd.Series([10, 20, 30, 40])
print(s)

0    10
1    20
2    30
3    40
dtype: int64


In [14]:
t = pd.Series([130, 120, 160, 140], index=['a', 'b', 'c', 'd'])
print(t)

a    130
b    120
c    160
d    140
dtype: int64


In [17]:
# I can create a dict form a serie...
f = t.to_dict()
print(f)

# ....and viceversa
d = pd.Series(f)
print(d)

{'a': 130, 'b': 120, 'c': 160, 'd': 140}
a    130
b    120
c    160
d    140
dtype: int64


## Creating a DataFrame
A DataFrame is a two-dimensional, size-mutable, and heterogeneous data structure with named columns each of which can be a different value type. It can be defined as a **dictionary of Series**

In [18]:
data = {
    'Name': ['Anna', 'Luca', 'Marco', 'Julia','Toni','Bepi', 'Nane'],
    'Age': [17, 18, 16, 17, 18, 19, 18],
    'Class': ['5A', '5B', '4A', '5A', '5C','5D', '5A']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Class
0,Anna,17,5A
1,Luca,18,5B
2,Marco,16,4A
3,Julia,17,5A
4,Toni,18,5C
5,Bepi,19,5D
6,Nane,18,5A


In [20]:
# Retrieve the 'Age' column as a Series
age_series = df['Age']
print(age_series)
print(type(age_series))

0    17
1    18
2    16
3    17
4    18
5    19
6    18
Name: Age, dtype: int64
<class 'pandas.core.series.Series'>


In [19]:
print("First 5 rows of the DataFrame:")
print(df.head())

print("\nLast 5 rows of the DataFrame:")
print(df.tail())

First 5 rows of the DataFrame:
    Name  Age Class
0   Anna   17    5A
1   Luca   18    5B
2  Marco   16    4A
3  Julia   17    5A
4   Toni   18    5C

Last 5 rows of the DataFrame:
    Name  Age Class
2  Marco   16    4A
3  Julia   17    5A
4   Toni   18    5C
5   Bepi   19    5D
6   Nane   18    5A


## Basic DataFrame operations

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   Class   4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


The df.describe() method generates descriptive statistics of the DataFrame. It provides a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns. As you can see from the output, it includes:

* count: The number of non-null values.
* mean: The average value.
* std: The standard deviation.
* min: The minimum value.
* 25%: The first quartile (25th percentile).
* 50%: The median (50th percentile).
* 75%: The third quartile (75th percentile).
* max: The maximum value.

In [24]:
df.describe()

Unnamed: 0,Age
count,7.0
mean,17.571429
std,0.9759
min,16.0
25%,17.0
50%,18.0
75%,18.0
max,19.0


`df.describe()` by default only shows descriptive statistics for numerical columns in the DataFrame. In your df DataFrame, 'Age' is the only numerical column, while 'Name' and 'Class' are of object type (strings).

In [25]:
df.describe(include='all')

Unnamed: 0,Name,Age,Class
count,7,7.0,7
unique,7,,5
top,Anna,,5A
freq,1,,3
mean,,17.571429,
std,,0.9759,
min,,16.0,
25%,,17.0,
50%,,18.0,
75%,,18.0,


In [23]:
df['Age'].mean()

np.float64(17.571428571428573)

In [22]:
df['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
5A,3
5B,1
4A,1
5C,1
5D,1


## Filtering and selecting data

In [21]:
df[df['Age'] > 17]

Unnamed: 0,Name,Age,Class
1,Luca,18,5B
4,Toni,18,5C
5,Bepi,19,5D
6,Nane,18,5A


In [26]:
df[df['Class'] == '5A']

Unnamed: 0,Name,Age,Class
0,Anna,17,5A
3,Julia,17,5A
6,Nane,18,5A


In [27]:
df.loc[0:2, ['Name', 'Class']]

Unnamed: 0,Name,Class
0,Anna,5A
1,Luca,5B
2,Marco,4A


The statement `df.loc[0:2, ['Name', 'Class']]` is used to select specific rows and columns from your `df` DataFrame based on their labels.

* `.loc`: This is a label-based indexer for selecting data by row and column labels.
* `0:2`: This selects rows with labels from 0 up to and including 2.
* `['Name', 'Class']`: This selects the columns with the labels 'Name' and 'Class'.

So, this code will return a new DataFrame containing the rows with index labels 0, 1, and 2, and only the 'Name' and 'Class' columns from those rows.

## Exercises

1. Create a DataFrame with at least 5 students, including columns: Name, Surname, Age, Class.
2. Calculate the average age of the students.
3. Display how many students are in each class.
4. Filter only the students in class 5A.

Try to complete these tasks in new code cells below.

In [None]:
## solve the exercise here