# 🐼 Guide to Using Pandas Methods and Functions
This guide provides a practical overview of the main Pandas methods and functions for data analysis. From DataFrame management to cleaning and transformation, you will find useful examples to manipulate and analyze data in Python efficiently.

By [Enzo Schitini]('https://www.linkedin.com/in/enzoschitini/')

Data Scientist & Data Analyst • SQL • Expert Bubble.io • UX & UI @ Scituffy creator

Pandas is one of the most powerful and widely used libraries for manipulating and analyzing data in Python. Whether you are an aspiring data scientist, an experienced analyst, or simply someone who works with data, mastering Pandas can greatly improve your productivity and data processing skills. This guide aims to provide a comprehensive overview of Pandas' essential methods and functions, enabling you to tackle complex data operations with ease and efficiency.

In this guide, you will explore fundamental concepts such as data cleansing, transformation, aggregation, and visualization techniques using Pandas. Through practical examples and step-by-step instructions, you will gain a deeper understanding of how to leverage Pandas' full potential to simplify and enhance your data workflows.

### `Dataset:` For this guide, we will use data from an online store, although with fewer rows and columns than the original. We have 2022 rows and 10 columns.
Link Dataset: https://github.com/enzoschitini/Guide-to-Using-Pandas/blob/main/pandas_csv_guide.csv

| order_id                            | customer_state | product_category_name | product_weight_g | review_score | price | freight_value | payment_value | order_approved_at     | order_purchase_timestamp |
|-------------------------------------|----------------|-----------------------|------------------|--------------|-------|---------------|---------------|-----------------------|-------------------------|
| 00010242fe8c5a6d1ba2dd792cb16214    | RJ             | cool_stuff            | 650.0            | 5            | 58.9  | 13.29         | 72.19         | 2017-09-13 09:45:35   | 2017-09-13 08:59:02     |
| 130898c0987d1801452a8ed92a670612    | GO             | cool_stuff            | 650.0            | 5            | 55.9  | 17.96         | 73.86         | 2017-06-29 02:44:11   | 2017-06-28 11:52:20     |
| 532ed5e14e24ae1f0d735b91524b98b9    | MG             | cool_stuff            | 650.0            | 4            | 64.9  | 18.33         | 83.23         | 2018-05-18 12:31:43   | 2018-05-18 10:25:53     |
| 6f8c31653edb8c83e1a739408b5ff750    | PR             | cool_stuff            | 650.0            | 5            | 58.9  | 16.17         | 75.07         | 2017-08-01 18:55:08   | 2017-08-01 18:38:42     |
| 7d19f4ef4d04461989632411b7e588b9    | MG             | cool_stuff            | 650.0            | 5            | 58.9  | 13.29         | 72.19         | 2017-08-10 22:05:11   | 2017-08-10 21:48:40     |

### Description of columns:

- **order_id**: Unique identifier for the order. Each row represents a specific order made by the customer.

- **customer_state**: Brazilian state where the customer resides. It is represented by a two-letter code (e.g., RJ for Rio de Janeiro).

- **product_category_name**: Category of the purchased product. For example, "cool_stuff" indicates a specific product category.

- **product_weight_g**: Weight of the product in grams. This provides information about the weight of the ordered product.

- **review_score**: Review score given by the customer for the order, typically on a scale from 1 to 5.

- **price**: Price of the product in the local currency (Brazilian reais). This indicates the cost of the purchased product.

- **freight_value**: Shipping cost in the local currency. This represents the shipping charge for the order.

- **payment_value**: Total amount paid for the order, including the product price and the shipping cost.

- **order_approved_at**: Date and time when the order was approved for shipping.

- **order_purchase_timestamp**: Date and time when the order was placed by the customer.

``` python
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/pandas_csv_guide.csv').drop(columns='Unnamed: 0')

```

## Topics to work on within this guide:
I have chosen the 10 topics that in my opinion are most used in Pandas to analyze data.

### 1. **Loading Data**

- `pd.read_csv()` – Loads data from a CSV file.
- `pd.read_excel()` – Loads data from an Excel file.
- `pd.read_sql()` – Loads data from a SQL database.
- `pd.read_json()` – Loads data from a JSON file.
- `pd.read_parquet()` – Loads data from a Parquet file, useful for large datasets.
- `pd.read_html()` – Parses HTML tables from a webpage.
- `pd.read_pickle()` – Loads data saved in Python’s pickle format.
- `pd.read_feather()` – Loads data from a Feather-format file, suitable for fast input/output.
- `pd.read_sas()` – Loads data from SAS files.
- `pd.read_hdf()` – Loads data from HDF5-format files.

### 2. **Inspecting Data**

- `.head(n)` – Shows the first `n` rows of the DataFrame (default: 5).
- `.tail(n)` – Shows the last `n` rows of the DataFrame.
- `.shape` – Returns the dimensions (rows, columns) of the DataFrame.
- `.columns` – Lists the column names.
- `.info()` – Displays information about the DataFrame (column types, non-null counts).
- `.describe()` – Provides descriptive statistics for numeric columns.
- `.dtypes` – Returns data types of all columns.
- `.index` – Returns the index (row labels) of the DataFrame.
- `.value_counts()` – Counts unique values in a column.
- `.isnull()` / `.notnull()` – Checks for missing values.
- `.duplicated()` – Checks for duplicate rows.
- `.nunique()` – Counts the number of unique values per column.
- `.sample(n)` – Randomly selects `n` rows from the DataFrame.

### 3. **Selecting and Indexing Data**

- `.loc[]` – Accesses groups of rows and columns by labels.
- `.iloc[]` – Accesses groups of rows and columns by position (integer-based).
- `.at[]` – Accesses a single value for a row/column label pair.
- `.iat[]` – Accesses a single value for a row/column position pair.
- `.filter()` – Subsets the DataFrame based on row/column labels.
- `.xs()` – Gets cross-sections from a MultiIndex.
- `.query()` – Filters the DataFrame using a string expression.
- `.get()` – Retrieves elements from a Series by key.
- `.isin()` – Filters rows based on whether values are in a list.
- `.where()` – Sets values based on a condition.
- `.mask()` – Replaces values where a condition is `True`.
- `.squeeze()` – Converts a DataFrame with a single column to a Series.

### 4. **Data Cleaning**

- `.drop()` – Removes specified labels from rows or columns.
- `.dropna()` – Removes rows/columns with missing values.
- `.fillna()` – Replaces missing values with a specified value.
- `.replace()` – Replaces values within the DataFrame.
- `.rename()` – Renames columns or indices.
- `.interpolate()` – Fills NaN values with interpolated values.
- `.bfill()` / `.ffill()` – Backward or forward fill of NaN values.
- `.convert_dtypes()` – Converts columns to optimal data types.
- `.clip()` – Limits values below or above a threshold.
- `.abs()` – Computes the absolute value of numeric columns.
- `.round(decimals)` – Rounds values to a specified number of decimals.

### 5. **Data Transformation**

- `.astype()` – Changes data type of columns.
- `.apply()` – Applies a function along an axis (rows/columns).
- `.applymap()` – Applies a function element-wise.
- `.map()` – Maps values from one column to another.
- `.sort_values()` – Sorts the DataFrame by columns.
- `.sort_index()` – Sorts the DataFrame by its index.
- `.reset_index()` – Resets the DataFrame’s index.
- `.pivot()` – Reshapes data based on column values.
- `.rank()` – Ranks values within each column.
- `.cumsum()` / `.cumprod()` – Computes cumulative sums/products.
- `.diff()` – Computes difference between subsequent rows.
- `.expanding()` – Applies expanding transformations (e.g., cumulative sum).
- `.pipe()` – Applies custom functions to the DataFrame.
- `.eval()` – Evaluates a Python expression as a column in the DataFrame.

### 6. **Aggregation and Grouping**

- `.groupby()` – Groups the DataFrame based on one or more columns.
- `.agg()` – Applies aggregation functions like sum, mean, min, max on grouped data.
- `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` – Directly calculates these statistics.
- `.pivot_table()` – Creates a pivot table with specified rows, columns, and values.
- `.transform()` – Applies functions to grouped columns using `groupby()`.
- `.size()` – Gets the size of each group.
- `.cumcount()` – Counts cumulative occurrences of unique values.
- `.nsmallest(n, columns)` – Finds the `n` smallest values in a column.
- `.nlargest(n, columns)` – Finds the `n` largest values in a column.
- `.mad()` – Mean absolute deviation for grouped data.
- `.rolling(window).apply()` – Applies a function on a rolling window.

### 7. **Merging and Combining Data**

- `pd.merge()` – Merges DataFrames on specified columns.
- `.join()` – Joins DataFrames on indices.
- `pd.concat()` – Concatenates DataFrames along rows or columns.

### 8. **Exploring Temporal Data**

- `.resample()` – Groups and summarizes data based on a temporal frequency.
- `.to_datetime()` – Converts strings to datetime objects.
- `.dt` accessor – Accesses date components like year, month, day.
- `.rolling()` – Applies operations on a temporal rolling window.
- `.shift()` – Shifts data over time (e.g., periods).
- `.diff()` – Computes the difference of successive values in time series.
- `.asfreq()` – Changes the frequency of a time series index.
- `.between_time()` – Extracts rows based on a specific time range.
- `.at_time()` – Extracts rows for a specific time.
- `.truncate()` – Trims rows before or after a specific date.

### 9. **Exporting Data**

- `.to_csv()` – Exports data to a CSV file.
- `.to_excel()` – Exports data to an Excel file.
- `.to_sql()` – Exports data to a SQL database.
- `.to_json()` – Exports data in JSON format.
- `.to_parquet()` – Exports data in Parquet format.
- `.to_pickle()` – Exports data to a Python pickle file.
- `.to_html()` – Exports data to an HTML table.
- `.to_latex()` – Exports data in LaTeX format.
- `.to_dict()` – Converts data to a Python dictionary.
- `.to_markdown()` – Exports data in Markdown format.
- `.to_clipboard()` – Copies data to the clipboard.
- `.to_string()` – Converts the DataFrame to a string.
- `.to_records()` – Converts the DataFrame to an array of records.
- `.to_feather()` – Exports data in Feather format.

### 10. **Handling Multi-Level Indices (MultiIndex)**

- `.set_index()` – Sets one or more columns as the DataFrame’s index.
- `.reset_index()` – Resets the DataFrame index, moving current indices to columns.
- `.sort_index()` – Sorts the DataFrame by index values.
- `.swaplevel()` – Swaps levels of a MultiIndex.
- `.stack()` – Compresses column levels into rows.
- `.unstack()` – Expands row levels into columns.
- `.reorder_levels()` – Reorders levels of a MultiIndex.
- `.index.get_level_values()` – Extracts values of a specific level from a MultiIndex.
- `.droplevel()` – Removes a level from a MultiIndex.
- `.groupby(level=...)` – Groups data based on MultiIndex levels.

## 🔥Let's get started!

In [13]:
import pandas as pd

## 1. Loading Data

In [14]:
def url_github(type):
    url_data = f'https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/pandas_{type}_guide.{type}'
    return url_data

### Import CSV, Execel, parquet, feather, json

In [15]:
csv = pd.read_csv(url_github('csv')).drop(columns='Unnamed: 0')
excel = pd.read_excel(url_github('xlsx')).drop(columns='Unnamed: 0')
parquet = pd.read_parquet(url_github('parquet'))
feather = pd.read_feather(url_github('feather'))
json = pd.read_json(url_github('json'))

### Import SQL

In [16]:
import requests
import sqlite3

# Scarica il file dal link
url = 'https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/pandas_sql_guide.db'
response = requests.get(url)
with open('pandas_sql_guide.db', 'wb') as f:
    f.write(response.content)

# Crea una connessione al database SQLite locale
conn = sqlite3.connect('pandas_sql_guide.db')

# Leggi la tabella SQL in un DataFrame pandas
df_importato = pd.read_sql('SELECT * FROM pandas_sql_guide', conn).drop(columns='Unnamed: 0')

# Chiudi la connessione
conn.close()

df_importato.head()


Unnamed: 0,order_id,customer_state,product_category_name,product_weight_g,review_score,price,freight_value,payment_value,order_approved_at,order_purchase_timestamp
0,00010242fe8c5a6d1ba2dd792cb16214,RJ,cool_stuff,650.0,5,58.9,13.29,72.19,2017-09-13 09:45:35,2017-09-13 08:59:02
1,130898c0987d1801452a8ed92a670612,GO,cool_stuff,650.0,5,55.9,17.96,73.86,2017-06-29 02:44:11,2017-06-28 11:52:20
2,532ed5e14e24ae1f0d735b91524b98b9,MG,cool_stuff,650.0,4,64.9,18.33,83.23,2018-05-18 12:31:43,2018-05-18 10:25:53
3,6f8c31653edb8c83e1a739408b5ff750,PR,cool_stuff,650.0,5,58.9,16.17,75.07,2017-08-01 18:55:08,2017-08-01 18:38:42
4,7d19f4ef4d04461989632411b7e588b9,MG,cool_stuff,650.0,5,58.9,13.29,72.19,2017-08-10 22:05:11,2017-08-10 21:48:40


In [17]:
"""import sqlite3  # or use other database connectors like SQLAlchemy for different databases

# Load your CSV file into a DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/pandas_csv_guide.csv')

# Create a connection to a SQLite database (or another database)
conn = sqlite3.connect('pandas_sql_guide.db')  # Creates a database file if it doesn't exist

# Save the DataFrame to the SQL database
df.to_sql('pandas_sql_guide', conn, if_exists='replace', index=False)

# Close the connection
conn.close()"""

"import sqlite3  # or use other database connectors like SQLAlchemy for different databases\n\n# Load your CSV file into a DataFrame\ndf = pd.read_csv('https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/pandas_csv_guide.csv')\n\n# Create a connection to a SQLite database (or another database)\nconn = sqlite3.connect('pandas_sql_guide.db')  # Creates a database file if it doesn't exist\n\n# Save the DataFrame to the SQL database\ndf.to_sql('pandas_sql_guide', conn, if_exists='replace', index=False)\n\n# Close the connection\nconn.close()"

### Import HTML

``` python
pip install lxml
```

In [18]:
list_of_dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita')
list_of_dfs[1]

Unnamed: 0_level_0,Country/Territory,IMF[4][5],IMF[4][5],World Bank[6],World Bank[6],United Nations[7],United Nations[7]
Unnamed: 0_level_1,Country/Territory,Estimate,Year,Estimate,Year,Estimate,Year
0,Monaco,—,—,240862,2022,240535,2022
1,Liechtenstein,—,—,187267,2022,197268,2022
2,Luxembourg,135321,2024,128259,2023,125897,2022
3,Bermuda,—,—,123091,2022,117568,2022
4,Switzerland,106098,2024,99995,2023,93636,2022
...,...,...,...,...,...,...,...
218,Yemen,465,2024,533,2023,327,2022
219,Malawi,464,2024,673,2023,615,2022
220,Afghanistan,411,2023,353,2022,345,2022
221,South Sudan,341,2024,1072,2015,423,2022


### Create a Dataset

In [19]:
filename_features = "https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/UCI%20HAR%20Dataset/features.txt"
filename_labels = "https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/UCI%20HAR%20Dataset/activity_labels.txt"

filename_subtrain = "https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/UCI%20HAR%20Dataset/train/subject_train.txt"
filename_xtrain = "https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/UCI%20HAR%20Dataset/train/X_train.txt"
filename_ytrain = "https://raw.githubusercontent.com/enzoschitini/Guide-to-Using-Pandas/refs/heads/main/Data/UCI%20HAR%20Dataset/train/y_train.txt"

In [20]:
features = pd.read_csv(filename_features, header=None, sep="#")
features.columns = ['nome_var']
labels = pd.read_csv(filename_labels, delim_whitespace=True, header=None, names=['cod_label', 'label'])

In [25]:
"""subject_train = pd.read_csv(filename_subtrain, header=None, names=['subject_id'])
X_train = pd.read_csv(filename_xtrain, delim_whitespace=True, header=None, names=features['nome_var'].tolist())
y_train = pd.read_csv(filename_ytrain, header=None, names=['cod_label'])"""

"subject_train = pd.read_csv(filename_subtrain, header=None, names=['subject_id'])\nX_train = pd.read_csv(filename_xtrain, delim_whitespace=True, header=None, names=features['nome_var'].tolist())\ny_train = pd.read_csv(filename_ytrain, header=None, names=['cod_label'])"

In [22]:
#X_train.head()

In [None]:
#y_train.head()

Unnamed: 0,cod_label
0,5
1,5
2,5
3,5
4,5


## 2. Inspecting Data

### ``.head(n)`` – Shows the first n rows of the DataFrame (default: 5)

In [29]:
csv.head(n=5)

Unnamed: 0,order_id,customer_state,product_category_name,product_weight_g,review_score,price,freight_value,payment_value,order_approved_at,order_purchase_timestamp
0,00010242fe8c5a6d1ba2dd792cb16214,RJ,cool_stuff,650.0,5,58.9,13.29,72.19,2017-09-13 09:45:35,2017-09-13 08:59:02
1,130898c0987d1801452a8ed92a670612,GO,cool_stuff,650.0,5,55.9,17.96,73.86,2017-06-29 02:44:11,2017-06-28 11:52:20
2,532ed5e14e24ae1f0d735b91524b98b9,MG,cool_stuff,650.0,4,64.9,18.33,83.23,2018-05-18 12:31:43,2018-05-18 10:25:53
3,6f8c31653edb8c83e1a739408b5ff750,PR,cool_stuff,650.0,5,58.9,16.17,75.07,2017-08-01 18:55:08,2017-08-01 18:38:42
4,7d19f4ef4d04461989632411b7e588b9,MG,cool_stuff,650.0,5,58.9,13.29,72.19,2017-08-10 22:05:11,2017-08-10 21:48:40


### ``.tail(n)`` – Shows the last n rows of the DataFrame

In [30]:
csv.tail(n=5)

Unnamed: 0,order_id,customer_state,product_category_name,product_weight_g,review_score,price,freight_value,payment_value,order_approved_at,order_purchase_timestamp
2017,bb0c66e312ff8cb97698f012cd92553c,SP,perfumaria,350.0,5,56.99,8.72,65.71,2017-11-22 02:56:28,2017-11-19 17:05:09
2018,c0db7d31ace61fc360a3eaa34dd3457c,SP,perfumaria,350.0,5,56.99,8.72,65.71,2018-02-13 16:50:30,2018-02-13 16:36:56
2019,c0db7d31ace61fc360a3eaa34dd3457c,SP,perfumaria,350.0,5,56.99,8.72,65.71,2018-02-13 16:50:30,2018-02-13 16:36:56
2020,c90025afa3c59ad0768b713161777935,SP,perfumaria,350.0,5,56.99,8.72,65.71,2018-03-01 02:50:46,2018-02-28 12:59:08
2021,cc3336764b2bc18f4eaa8f17f86bfd53,SP,perfumaria,350.0,5,56.99,7.78,64.77,2017-06-11 17:55:17,2017-06-11 17:43:18


### ``.shape`` – Returns the dimensions (rows, columns) of the DataFrame

In [31]:
csv.shape

(2022, 10)

### ``.columns`` – Lists the column names

In [32]:
csv.columns

Index(['order_id', 'customer_state', 'product_category_name',
       'product_weight_g', 'review_score', 'price', 'freight_value',
       'payment_value', 'order_approved_at', 'order_purchase_timestamp'],
      dtype='object')

In [34]:
list_of_dfs[1].columns

MultiIndex([('Country/Territory', 'Country/Territory'),
            (        'IMF[4][5]',          'Estimate'),
            (        'IMF[4][5]',              'Year'),
            (    'World Bank[6]',          'Estimate'),
            (    'World Bank[6]',              'Year'),
            ('United Nations[7]',          'Estimate'),
            ('United Nations[7]',              'Year')],
           )

### ``.info()`` – Displays information about the DataFrame (column types, non-null counts)

In [41]:
csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2022 entries, 0 to 2021
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   order_id                  2022 non-null   object 
 1   customer_state            2022 non-null   object 
 2   product_category_name     2022 non-null   object 
 3   product_weight_g          2022 non-null   float64
 4   review_score              2022 non-null   int64  
 5   price                     2022 non-null   float64
 6   freight_value             2022 non-null   float64
 7   payment_value             2022 non-null   float64
 8   order_approved_at         2022 non-null   object 
 9   order_purchase_timestamp  2022 non-null   object 
dtypes: float64(4), int64(1), object(5)
memory usage: 158.1+ KB


Il metodo `info()` di un oggetto DataFrame di Pandas fornisce un riepilogo conciso del contenuto del DataFrame, mostrando dettagli utili per l'analisi preliminare del dataset. I parametri principali di `info()` sono:

- `verbose`: (default `None`) Se impostato su `True`, mostrerà tutte le colonne, altrimenti una vista abbreviata (utile per dataset di grandi dimensioni).
- `buf`: (default `None`) Specifica l'output su un oggetto come un file. Se impostato su `None`, l'output viene stampato sulla console.
- `max_cols`: (default `None`) Limita il numero di colonne da visualizzare nel riepilogo.
- `memory_usage`: (default `True`) Mostra l'uso della memoria del DataFrame. Può essere impostato su `'deep'` per avere una stima più precisa.

In [39]:
csv.info(verbose=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2022 entries, 0 to 2021
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   order_id                  2022 non-null   object 
 1   customer_state            2022 non-null   object 
 2   product_category_name     2022 non-null   object 
 3   product_weight_g          2022 non-null   float64
 4   review_score              2022 non-null   int64  
 5   price                     2022 non-null   float64
 6   freight_value             2022 non-null   float64
 7   payment_value             2022 non-null   float64
 8   order_approved_at         2022 non-null   object 
 9   order_purchase_timestamp  2022 non-null   object 
dtypes: float64(4), int64(1), object(5)
memory usage: 727.2 KB


### ``.describe()`` – Provides descriptive statistics for numeric columns

In [46]:
csv.describe()

Unnamed: 0,product_weight_g,review_score,price,freight_value,payment_value
count,2022.0,2022.0,2022.0,2022.0,2022.0
mean,1154.325915,4.01731,73.512992,15.884327,100.126632
std,2643.718753,1.381362,63.045769,9.318311,111.499859
min,50.0,1.0,2.99,0.0,0.22
25%,203.75,3.0,51.9,11.68,65.71
50%,350.0,5.0,58.99,15.15,76.65
75%,950.0,5.0,84.99,17.68,106.38
max,30000.0,5.0,1050.0,185.73,1525.78


Il metodo `describe()` è utilizzato in Pandas per generare statistiche descrittive di un DataFrame o di una Serie. Per impostazione predefinita, restituisce le statistiche per le colonne numeriche, ma può essere utilizzato anche per dati categorici. Ecco una panoramica dei principali parametri:

### Principali Parametri di `describe()`

1. **`percentiles`**: Specifica i percentili che vuoi calcolare. Per impostazione predefinita, include il 25°, 50° (mediana) e il 75° percentile. Puoi specificare una lista di percentuali per ottenere valori personalizzati.
    - *Tipo*: array-like di numeri tra 0 e 1.
    - *Valore predefinito*: `[0.25, 0.5, 0.75]`.
2. **`include`**: Specifica quali tipi di dati includere nella descrizione. Può essere impostato su `None` (comportamento predefinito, solo colonne numeriche), `all` (tutti i tipi di dati) oppure su un elenco di tipi di dati come `['object']` o `['number']`.
    - *Valore predefinito*: `None`.
3. **`exclude`**: Specifica quali tipi di dati escludere dall'analisi. Funziona in modo complementare a `include`.

In [47]:
csv.describe(percentiles=[0.1, 0.9])

Unnamed: 0,product_weight_g,review_score,price,freight_value,payment_value
count,2022.0,2022.0,2022.0,2022.0,2022.0
mean,1154.325915,4.01731,73.512992,15.884327,100.126632
std,2643.718753,1.381362,63.045769,9.318311,111.499859
min,50.0,1.0,2.99,0.0,0.22
10%,200.0,1.0,21.0,8.72,29.56
50%,350.0,5.0,58.99,15.15,76.65
90%,2245.0,5.0,118.9,23.161,162.07
max,30000.0,5.0,1050.0,185.73,1525.78


In [48]:
csv.describe(include='all')

Unnamed: 0,order_id,customer_state,product_category_name,product_weight_g,review_score,price,freight_value,payment_value,order_approved_at,order_purchase_timestamp
count,2022,2022,2022,2022.0,2022.0,2022.0,2022.0,2022.0,2022,2022
unique,1735,27,21,,,,,,1734,1735
top,370e2e6c1a9fd451eb7f0852daa3b006,SP,beleza_saude,,,,,,2017-03-11 18:34:44,2017-03-11 18:34:44
freq,11,849,766,,,,,,11,11
mean,,,,1154.325915,4.01731,73.512992,15.884327,100.126632,,
std,,,,2643.718753,1.381362,63.045769,9.318311,111.499859,,
min,,,,50.0,1.0,2.99,0.0,0.22,,
25%,,,,203.75,3.0,51.9,11.68,65.71,,
50%,,,,350.0,5.0,58.99,15.15,76.65,,
75%,,,,950.0,5.0,84.99,17.68,106.38,,


### ``.dtypes`` – Returns data types of all columns

In [49]:
csv.dtypes

order_id                     object
customer_state               object
product_category_name        object
product_weight_g            float64
review_score                  int64
price                       float64
freight_value               float64
payment_value               float64
order_approved_at            object
order_purchase_timestamp     object
dtype: object

### ``.index`` – Returns the index (row labels) of the DataFrame

In [50]:
csv.index

RangeIndex(start=0, stop=2022, step=1)

### ``.value_counts()`` – Counts unique values in a column

In [51]:
csv.value_counts()

order_id                          customer_state  product_category_name   product_weight_g  review_score  price  freight_value  payment_value  order_approved_at    order_purchase_timestamp
2f839b79d9954ebfedeeba654f0f3de8  SP              telefonia               150.0             5             7.00   7.39           71.95          2018-03-26 14:50:21  2018-03-26 14:40:10         5
8a8bd4a338e17ace44431e99a2add1d2  DF              dvds_blu_ray            2150.0            5             83.99  18.17          20.00          2018-05-15 21:54:02  2018-05-15 21:31:55         5
58346246ea802a21cb34124ed2326770  SP              perfumaria              200.0             5             44.99  7.58           210.28         2018-07-11 17:25:53  2018-07-11 17:17:37         4
05fcd933547be81890bc4d62357fdf3f  SP              informatica_acessorios  300.0             1             89.90  12.13          408.12         2017-07-19 10:30:13  2017-07-19 10:17:34         4
84ddfd4c559558c53b5a4c6765e49be8  S

Il metodo `.value_counts()` in Pandas è utilizzato per ottenere una distribuzione delle occorrenze uniche dei valori in una Serie (o in una colonna di un DataFrame). È molto utile per esplorare i dati categorici o per comprendere la distribuzione dei valori in una colonna.

### Principali Parametri di `.value_counts()`

1. **`normalize`**: Se impostato a `True`, restituisce le frequenze relative dei valori invece dei conteggi assoluti.
    - *Tipo*: booleano (`True` o `False`).
    - *Valore predefinito*: `False`.
2. **`sort`**: Specifica se ordinare i risultati in base ai conteggi (in ordine decrescente). Se impostato a `False`, non ordina i valori.
    - *Tipo*: booleano (`True` o `False`).
    - *Valore predefinito*: `True`.
3. **`ascending`**: Se impostato a `True`, ordina i risultati in ordine crescente.
    - *Tipo*: booleano (`True` o `False`).
    - *Valore predefinito*: `False`.
4. **`bins`**: Consente di suddividere i dati numerici in intervalli (binning).
    - *Tipo*: intero.
    - *Valore predefinito*: `None`.
5. **`dropna`**: Se impostato a `False`, include i valori `NaN` nel conteggio.
    - *Tipo*: booleano (`True` o `False`).
    - *Valore predefinito*: `True`.

In [54]:
csv.value_counts('customer_state', normalize=True).head(5)

customer_state
SP    0.419881
MG    0.124135
RJ    0.112760
RS    0.048467
BA    0.044510
Name: proportion, dtype: float64

In [57]:
csv.value_counts('customer_state', ascending=True).head()

customer_state
AC    1
RR    2
AP    3
TO    3
SE    5
Name: count, dtype: int64

In [60]:
csv.value_counts('customer_state', dropna=False).head()

customer_state
SP    849
MG    251
RJ    228
RS     98
BA     90
Name: count, dtype: int64

In [67]:
csv['price'].value_counts(bins=3)

(1.9420000000000002, 351.993]    2007
(351.993, 700.997]                 10
(700.997, 1050.0]                   5
Name: count, dtype: int64

## 3. Selecting and Indexing Data

## 4. Data Cleaning

## 5. Data Transformation

## 6. Aggregation and Grouping

## 7. Merging and Combining Data

## 8. Exploring Temporal Data

## 9. Exporting Data

## 10. Handling Multi-Level Indices (MultiIndex)

## 11. Pandas Profiling

## 12. Method chaining

## Web Scraping [RACCOLTA]

##

<p align="center">
  Enzo Schitini
</p>

<p align="center">
  Data Scientist & Data Analyst • SQL • Expert Bubble.io • UX & UI @ Scituffy creator
</p>