<font size="6">**Reading tabular files in Python**</font><br>

> (c) 2025 Antonio Piemontese

<p style="color:red; font-size:18px; font-weight:bold;">
üö® Questo notebook √® una traduzione in inglese cella per cella del notebook 'Leggere file tabellari in Python' üö®
</p>

# Data in Python (in Data Science)
In Data Science we are often interested in:
- analysing **past data**, not in real time
- analysing a **single file**, not the DB

These data are usually **tabular files**, so called because they are <u>made of rows and columns</u> (2 dimensions). They are usually **created by users**, received **from other companies**, or **simply exported from a DB** (as an export).

One of the most widespread tabular formats for importing data into Python is the [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) format. It is practically **ubiquitous in all tabular data‚Äëmanagement tools and environments**: Excel, Google Sheets, all relational DBs, etc.

Even though in Python you can load **any kind of file** (xml, json, PDF, txt, etc.) and access **any kind of database** (Oracle, PostgreSQL, MySQL, etc.), the simplest and most efficient format to load in memory (into a `pandas` dataframe) is CSV.

```python
import pandas as pd
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
```

If you want **more performant** alternatives, nowadays `cuDF` (with GPU) or `Polars` are used a lot.

<p style="color:red; font-size:18px; font-weight:bold;">
üö® Tabular files (and also the SQL tables we will see later) are usually loaded in a pandas dataframe üö®
</p>
Pandas is convenient but it is not the only way to import tabular files in Python.<br>


> ‚ÄúA CSV or Excel file (**a *tabular* file**) may look like a table as in the database, but it is only a container of raw data.
An SQL table instead is a controlled structure, with rules, types and relations, which the database manages in a consistent, safe and transactional way.‚Äù

# Reading tabular files through specific libraries

To read <u>csv</u> tabular files a **first option** is the **`csv` library**, i.e. Python‚Äôs **built‚Äëin CSV**.<br>
The reason for this choice, which however has many limits, is to avoid loading the *pandas* package (heavy) and to avoid dependency problems.

In [1]:
import csv

with open("Credit_ISLR.csv", newline="") as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        print(row)
        if i >= 20:   # print only the first 20 rows (0‚Äì19)
            break

['', 'ID', 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Gender', 'Student', 'Married', 'Ethnicity', 'Balance']
['1', '1', '14.891', '3606', '283', '2', '34', '11', ' Male', 'No', 'Yes', 'Caucasian', '333']
['2', '2', '106.025', '6645', '483', '3', '82', '15', 'Female', 'Yes', 'Yes', 'Asian', '903']
['3', '3', '104.593', '7075', '514', '4', '71', '11', ' Male', 'No', 'No', 'Asian', '580']
['4', '4', '148.924', '9504', '681', '3', '36', '11', 'Female', 'No', 'No', 'Asian', '964']
['5', '5', '55.882', '4897', '357', '2', '68', '16', ' Male', 'No', 'Yes', 'Caucasian', '331']
['6', '6', '80.18', '8047', '569', '4', '77', '10', ' Male', 'No', 'No', 'Caucasian', '1151']
['7', '7', '20.996', '3388', '259', '2', '37', '12', 'Female', 'No', 'No', 'African American', '203']
['8', '8', '71.408', '7114', '512', '2', '87', '9', ' Male', 'No', 'No', 'Asian', '872']
['9', '9', '15.125', '3300', '266', '5', '66', '13', 'Female', 'No', 'No', 'Caucasian', '279']
['10', '10', '71.061', '6819

It is **very lightweight**, but you have to handle **everything manually** (types, header, encoding...):
- type handling (everything is a string) ‚Äì the `csv` module does not automatically convert types: everything it reads is a string.
- header handled manually ‚Äì the `csv` module does not know by itself whether the first line is a header or data.
- encoding issues ‚Äì if the file is not UTF‚Äë8 (e.g. it is `latin-1` or `windows-1252`), `open()` will raise an error or show strange characters.
- different separators (`,` or `;` or `\t`) ‚Äì CSV files are not always comma‚Äëseparated ‚Äî in Italy often `;` or tabs.
- quotes and special characters ‚Äì if a field contains a comma or a newline, parsing can break if you don‚Äôt use the right parameters.
- missing data (empty cells) ‚Äì there is no concept of `NaN`.
- with millions of rows, `csv.reader` is faster than pandas at reading raw rows, but then you cannot easily filter, join, or operate on data.

**In short**:<br>
Using plain `csv` is like reading the file ‚Äúby hand‚Äù: we have full control but all the work is on us. *pandas* (or *Polars*) instead understand header, types, separators, encoding, missing, etc. automatically.


A **second option** is the **`openpyxl`** module (for Excel **.xlsx** files).

In [2]:
from openpyxl import load_workbook

wb = load_workbook("Credit_ISLR.xlsx")
ws = wb.active

for i, row in enumerate(ws.iter_rows(values_only=True)):
    print(row)
    if i >= 19:    # index starts from 0 ‚Üí 0‚Äì19 = 20 rows
        break


('Column1', 'ID', 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Gender', 'Student', 'Married', 'Ethnicity', 'Balance')
(1, 1, 14891, 3606, 283, 2, 34, 11, ' Male', 'No', 'Yes', 'Caucasian', 333)
(2, 2, 106025, 6645, 483, 3, 82, 15, 'Female', 'Yes', 'Yes', 'Asian', 903)
(3, 3, 104593, 7075, 514, 4, 71, 11, ' Male', 'No', 'No', 'Asian', 580)
(4, 4, 148924, 9504, 681, 3, 36, 11, 'Female', 'No', 'No', 'Asian', 964)
(5, 5, 55882, 4897, 357, 2, 68, 16, ' Male', 'No', 'Yes', 'Caucasian', 331)
(6, 6, 8018, 8047, 569, 4, 77, 10, ' Male', 'No', 'No', 'Caucasian', 1151)
(7, 7, 20996, 3388, 259, 2, 37, 12, 'Female', 'No', 'No', 'African American', 203)
(8, 8, 71408, 7114, 512, 2, 87, 9, ' Male', 'No', 'No', 'Asian', 872)
(9, 9, 15125, 3300, 266, 5, 66, 13, 'Female', 'No', 'No', 'Caucasian', 279)
(10, 10, 71061, 6819, 491, 3, 41, 19, 'Female', 'Yes', 'Yes', 'African American', 1350)
(11, 11, 63095, 8117, 589, 4, 30, 14, ' Male', 'No', 'Yes', 'Caucasian', 1407)
(12, 12, 15045, 1311, 138

A **third** option is the `xlrd` or `xlwt` module ‚Äì still for Excel ‚Äì for reading and writing `.xls`.<br>
‚ö†Ô∏è Deprecated for `.xlsx`, so today they are less recommended.

So, in a nutshell:<br>
![](sintesi_formati_tabellari_en.png)

# Loading into pandas

If you only want to read/write (tabular) files without installing heavy external libraries, you can use `csv` or `openpyxl`, as shown before.

Otherwise, you use the [**dataframes**](https://en.wikipedia.org/wiki/Pandas_(software)#DataFrames), which are the **most used data structure in Data Science**:
- they live in memory
- they can be loaded from disk or saved to disk with the `pd.read_***` functions or with the `df.to_XXX` methods for the following formats:<br>
*clipboard, csv, excel, html, json, parquet, pickle, sas, sql, spss, stata, xml*.

To load a *csv* or *xlsx* tabular file into a dataframe you use these two *pandas* functions:
```python
    import pandas as pd
    df = pd.read_csv('data.csv')
    df = pd.read_excel('data.xlsx')
```

If you want **more performant** alternatives, nowadays `cuDF` (with GPU) or `Polars` are used a lot.

<p style="color:red; font-size:18px; font-weight:bold;">
üö® Tabular files (and also the SQL tables we will see later) are usually loaded in a pandas dataframe üö®
</p>
Pandas is convenient but it is not the only way to import tabular files in Python.<br>


# Excel or csv?
What is the best format to import tabular files? Excel or csv?<br>
It depends on the goal and on the context, but **in most cases CSV is more efficient, transparent and robust, whereas Excel is more convenient for the human user**.

---

Let‚Äôs compare **CSV vs Excel** from the <u>technical</u> point of view and from the <u>human</u> point of view.

---

---
**"External libraries"**: what does it mean?<br>
The function `pd.read_excel()` is built into pandas, but it does not do *everything* by itself: for some Excel formats it relies on **external libraries** that actually handle the Excel format.<br>
That is, how does `pd.read_excel()` really work?<br>
When you call:
```python
import pandas as pd
df = pd.read_excel("data.xlsx")
```
pandas:
- recognises the file format (e.g. `.xls`, `.xlsx`, `.xlsb`)
- uses an external ‚Äúengine‚Äù to read the data
- turns what it reads into a `DataFrame`

---

**3. Performance: indicative comparison**:<br>
![](performance_csv_excel_en.png)

**4. In practice**:<br>
üëâ If the file **comes from a system or an application** (ERP, CRM, management, accounting, ‚Ä¶) it is almost always better to receive it as **CSV**.<br>
üëâ If the file **is made by a person** and must be read by a person, Excel is nicer, but as soon as you need to automate the processing, CSV (or another machine‚Äëfriendly format) becomes preferable.<br>

üëâ In any case you can always convert Excel ‚Üí csv with `pandas.to_csv()` or `to_parquet()` for internal / efficient storage.


# The importance and spread of the *csv* format
Let‚Äôs see in more detail **why the CSV format is so ubiquitous in the data world**: practically one cannot imagine a tool for managing tabular data (Excel, Google Sheets, LibreOffice Calc, most reporting, BI and relational DB tools, etc.) that does not allow csv import/export.

# The importance and diffusion of the *csv* format
Let‚Äôs see in more detail **why the CSV format is so ubiquitous in the data world** (it is found everywhere, always).
It is a format supported by almost all tools (Excel, DBs, BI, reporting, data‚Äëscience tools...).
![](importanza_csv_en.png)

# Technical reasons for the diffusion of csv
There are several **very concrete technical reasons** that explain why the CSV format is so omnipresent in the data world.<br>
Here is a clear and technical summary table:

![](diffusione_csv_en.png)

üí° In short:<br>
CSV is the **‚Äúlowest common denominator‚Äù of tabular data**: simple, textual, line‚Äëbased, without dependencies and compatible with everything ‚Äî from Excel to Spark.<br>
It is not perfect (no types, schema, compression or metadata), but exactly **its structural poverty is its strength**.

# Reading CSV files in pandas

As said, in Data Science we often are NOT interested in the online DB but in **a local file**, which can be of various formats (csv, json, parquet, etc.).

## 3 technical notes on the CSV format

* the two main arguments of the pandas `read_csv` method are the **column separator** (default `,`) and the **presence** (and possible number) of heading rows (header).
* there are different csv formats available from Excel; you must choose the correct one (see the YouTube video of *Excel Tutorials by EasyClick Academy*).
  
  ![](tipi_csv_en.png)
* [pros and cons](https://towardsdatascience.com/why-i-stopped-dumping-dataframes-to-a-csv-and-why-you-should-too-c0954c410f8f) of the csv format

A csv file is textual and therefore can also be read in Notepad.



Let‚Äôs load in *pandas* the well‚Äëknown banking file `Credit_ISLR`:

In [3]:
import pandas as pd
df_credit = pd.read_csv("Credit_ISLR.csv",header=0)
df_credit

Unnamed: 0.1,Unnamed: 0,ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,396,396,12.096,4100,307,3,32,13,Male,No,Yes,Caucasian,560
396,397,397,13.364,3838,296,5,65,17,Male,No,No,African American,480
397,398,398,57.872,4171,321,5,67,12,Female,No,Yes,Caucasian,138
398,399,399,37.728,2525,192,1,44,13,Male,No,Yes,Caucasian,0


As you can see, the *pandas* `read_csv` function automatically created the column with index, so the original index `ID` is now <u>redundant</u>, and it added a column `Unnamed: 0` (we will see later why). It is good to drop both because they are useless.

In [4]:
df_credit.drop(columns=['Unnamed: 0', 'ID'], inplace=True)

In [5]:
df_credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


## Is pandas ‚Äúheavy‚Äù?

‚öôÔ∏è What does it mean that ‚Äúpandas is heavy‚Äù? It is, in several ways:

1Ô∏è‚É£ **Size and complexity**

* Pandas is not a tiny library:<br>

  * installing it brings in a lot of dependencies:
  * `numpy`, `dateutil`, `pytz`, `tzdata`, `matplotlib`, `openpyxl`, `xlrd`, etc.
* The package size is **dozens of MB**.
* Loading it into memory at startup is **slower** than a simple `import csv`.

üëâ So for a script that just has to read a CSV file and print 10 rows, importing the whole of pandas is like using a truck to deliver a letter.

2Ô∏è‚É£ **External dependencies**<br>

To work well with many formats, pandas uses external libraries (as mentioned above):

* `openpyxl` for *.xlsx* files
* `xlrd` for *.xls* files
* `pyarrow` for *.parquet* files
* `numexpr` for numeric operations
* `matplotlib` for `.plot()`<br>

üëâ These dependencies are convenient in a data-science environment,
but excessive in a system script or a microservice.

3Ô∏è‚É£ **Impact on small environments**<br>

In contexts such as:

* Docker microservices
* lightweight CLI scripts
* serverless functions (AWS Lambda, GCP Functions)
* systems with memory constraints

importing pandas can:

* slow down the script startup,
* increase the Docker image by tens or even hundreds of MB,
* lead to incompatibilities or long cold-start times.

**That‚Äôs why sometimes you want to ‚Äúavoid pandas‚Äù:**

* You don‚Äôt need its advanced analytics features.
* You just want to read a file and iterate over its rows.
* You want to reduce dependencies and startup time.

In that case it makes more sense to use:

```python
  import csv      # for CSV files
  import openpyxl # for modern Excel files
```

which are much lighter modules.

üß† **Metaphor**<br>
Pandas is like Excel or a full ERP system.<br>
If you just need to open a text file and print two columns, Notepad is enough.


## Converting Excel files to CSV format

Conversely, to view a CSV file **in the standard Excel format** you can do this (there are other ways too):
* open a **new file**
* `Data` tab
* button at the top left `Get Data`
* `From file` --> `From text/CSV`
* in the preview make any changes (on loading) ‚Äì maybe it‚Äôs enough ‚Äì and then press the "Load" button at the bottom right.

This process is well described in the video [How to Convert CSV to Excel (Simple and Quick)](https://www.youtube.com/watch?v=jw1DSuqr3ew) from *Excel Tutorials by EasyClick Academy*. Subtitles are available. 

If a file is created in Excel and then converted to CSV, saved, and later reopened in Excel, Excel shows it ‚Äúas Excel‚Äù (i.e. with the grid). Why?<br>

Excel doesn‚Äôt open the CSV as a real Excel file ‚Äî it **parses it as a text table** and displays it in the same UI (cells, rows, columns).<br>
That makes it *look* like Excel, but the file actually doesn‚Äôt contain any of the typical `.xlsx` features:

* no formatting,
* no formulas,
* no multiple sheets,
* no complex data types.

Excel is simply showing **a grid on top of a text file**.<br>
It‚Äôs a bit like opening a `.txt` file in Word: the content is just text, but the environment is Word, with its ‚Äúrich‚Äù look.

üí° One-sentence summary:<br>

> Excel detects the `.csv` extension, interprets the data as tabular, and shows it in its interface ‚Äî but the file remains plain structured text, not a real Excel workbook.

**Objection**: if I open in Excel a CSV that was generated independently of Excel, the grid is not applied.<br>
True, Excel doesn‚Äôt parse CSV ‚Äúsmartly‚Äù based on the content ‚Äî it uses the system‚Äôs **regional settings** (the Windows ‚Äúlocale‚Äù or ‚Äúlist separator‚Äù).<br>
For example:

* in Europe, the default list separator is `;` (semicolon);
* in the US/UK, it‚Äôs `,` (comma).

So:

* if the CSV comes from Excel, it uses the same separator as your regional settings ‚Üí Excel ‚Äúrecognizes‚Äù it and shows the table.
* if the CSV comes from another program (e.g. Python `to_csv()`, MySQL, or international systems) that uses the comma, Excel doesn‚Äôt know that‚Äôs the separator and puts everything in a single cell.

The user can fix it manually:

```text
Data ‚Üí From Text/CSV ‚Üí choose the correct delimiter (comma, semicolon, tab)
```

or change the system‚Äôs regional setting.

üí° In short:

> Excel ‚Äútreats as a table‚Äù only those CSV files that follow its regional conventions (separator and encoding).<br>
> If the file is generated elsewhere with other standards, Excel shows it as text in one column.

---
## Summary table ‚Äî Ways to open a CSV with the correct grid

| # | Method           | Path in Excel | Advantage | When to use it |
|---|------------------|---------------|-----------|----------------|
| 1 | **Guided import (recommended)** | **Data ‚Üí From Text/CSV ‚Üí** select the file ‚Üí choose **Delimiter** (comma, semicolon, tab) ‚Üí **Load** | Detects separator, encoding and shows a preview | Always, if the CSV was **not** generated by Excel |
| 2 | **Set system default separator (Windows)** | **Control Panel ‚Üí Region ‚Üí Additional settings ‚Üí ‚ÄúList separator‚Äù ‚Üí** set `,` or `;` | Excel opens CSV correctly with double-click | Useful if you often open CSVs with the **same** delimiter |
| 3 | **Rename file and change extension (trick)** | Rename `.csv` to `.txt`, then **Data ‚Üí From Text ‚Üí** follow the wizard | Forces you to choose the delimiter | Useful if Excel keeps putting everything into **one column** |
| 4 | **Open Excel first, then ‚ÄúOpen ‚Üí File ‚Üí CSV‚Äù** | **File ‚Üí Open ‚Üí Browse ‚Üí** file type: **All files** ‚Üí pick the CSV ‚Üí the import window appears | Lets you choose encoding and separator | Alternative to importing from **Data** |
| 5 | **Change Excel language/locale (optional)** | **File ‚Üí Options ‚Üí Advanced ‚Üí List separator** | Aligns Excel to the file format (e.g. US = `,`, Italy = `;`) | For mixed international / cloud usage |

---
üí¨ **Short explanation**

When Excel shows everything in one column, it‚Äôs because:

* the file delimiter (e.g. `,`)
  ‚â†
* the system list separator (e.g. `;` in Italy).

üëâ Fix: use the import wizard, which lets you choose the delimiter and encoding (UTF-8 recommended).<br>
After that choice, Excel immediately shows the correct grid and you can save as `.xlsx` if you want to keep it stable.

üí° Practical tip for those who often work with Python or external CSVs:

* use

  ```python
    df.to_csv("file.csv", sep=";", encoding="utf-8-sig")
  ```

  ‚Üí this way Excel (Italian version) opens it already ‚Äúas a grid‚Äù with no manual steps.
* `utf-8-sig` is a variant of UTF-8 encoding commonly used so that Excel correctly recognizes CSV files.


# Input parameters of `pd.read_csv` function

We have already mentioned the two **fundamental input arguments** of the pandas `read_csv` function: `sep` and `header`. They are **critical**:

- `header=1` tells pandas to skip the first row of the file and use the second row as the header.<br>
  If our CSVs do **not** have two header rows, or if the first file has a slightly different format from the others (spaces, separator, BOM, etc.), pandas will **misinterpret the columns**.
- `sep=';'` can break all the data (if the file is actually a real CSV with commas!).<br>
  Be careful: many ‚ÄúCSV files‚Äù actually use `sep=';'`!

In reality, the `read_csv` function has **many other input arguments**, as you can see from the help in the next cell ‚Äî we will later go deeper into the **main** ones.


In [6]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    *,
    sep: 'str | None | lib.NoDefault' = <no_default>,
    delimiter: 'str | None | lib.NoDefault' = None,
    header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
    names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
    index_col: 'IndexLabel | Literal[False] | None' = None,
    usecols: 'UsecolsArgType' = None,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters: 'Mapping[Hashable, Callable] | None' = None,
    true_values: 'list | None' = None,
    false_values: 'list | None' = None,
    skipinitialspace: 'bool' = False,
    skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
    skipfooter: 'int' = 0,
    nrows: 'int | None' = None,
    na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None,
  

> Reading a *csv* file is often one of the **first real difficulties** that those who start using pandas run into.
> The reason is that csv reading is simple in theory but in practice often the data is dirty, mixed, encoded differently, irregular.
> This makes the initial approach a bit **tricky**, a source of **not a few frustrations**, for two reasons:
> - the many and non‚Äëtrivial arguments of the `read_csv` function
> - the **irregularities** in the data in the files

## The mapping

The function `pd.read_csv` does an **automatic mapping** of CSV data into pandas, as described here:

![](how_pandas_infers_CSV_datatypes.png)

There is a problem that is not mentioned in the slide: the `read_csv` function **often fails to infer categorical variables** (when they are present in the CSV file as strings), so they are imported as `object`, the generic pandas string data type. As you can see:


In [7]:
df_credit.dtypes

Income       float64
Limit          int64
Rating         int64
Cards          int64
Age            int64
Education      int64
Gender        object
Student       object
Married       object
Ethnicity     object
Balance        int64
dtype: object

The variables `Gender`, `Student`, `Married` and `Ethnicity` are **categorical** in nature: that is, they can only take on a small, fixed number of values.
Other variables such as `Education`, if imported as text, could potentially take on an infinite number of values.

Each cell of the variable, if imported as `object`, **points to a string in memory, often duplicated several times**.

It is therefore necessary to **convert** these variables to the `category` format available from *pandas* (it does not exist in base Python), as follows:

In [8]:
df_credit['Gender'] = df_credit['Gender'].astype('category')
df_credit.dtypes

Income        float64
Limit           int64
Rating          int64
Cards           int64
Age             int64
Education       int64
Gender       category
Student        object
Married        object
Ethnicity      object
Balance         int64
dtype: object

The *pandas* method `astype('category')`:

* creates an **internal encoding table** (the ‚Äúlevels‚Äù or ‚Äúcategories‚Äù),
* represents the column as **internal integers** (0, 1, 2, ‚Ä¶) instead of repeated strings.

The behavior of `category` is similar to the **factor** in R.

üöÄ <font size="4">**Main advantages** (of `category`)</font><br>

üîπ **Memory efficiency**<br>
Each value becomes **an integer**, and the string is **stored only once in the category table**.<br>
üëâ On large datasets, the saving can reach 70‚Äì90% of RAM.
Example:

```python
    df['citt√†'].memory_usage(deep=True)
    df['citt√†'].astype('category').memory_usage(deep=True)
```

The second one takes much less space.

üîπ **Processing speed**<br>
Many pandas operations (`groupby`, `sort`, `value_counts`, `merges`) become **much faster**; in fact:

* comparing integers is faster than comparing strings,
* grouping and join algorithms work on numeric codes.
  üí° Typical case: `df.groupby('categoria').agg(...)` is much faster if `categoria` is `category`.

üîπ **Semantic meaning**<br>
A categorical variable has **a finite and known number of levels**.<br>
This is useful to:

* ensure that ‚Äúout-of-list‚Äù values don‚Äôt appear (e.g. ‚ÄòFemmina‚Äô vs ‚ÄòF‚Äô),
* keep the logical or hierarchical order (e.g. Low < Medium < High).<br>

You can also explicitly define the order like this:

```python
    df['livello'] = pd.Categorical(df['livello'], categories=['basso','medio','alto'], ordered=True)
```

‚Üí useful for comparisons, sorting, or encoding in machine learning.

üîπ **ML and preprocessing compatibility**<br>
Many machine-learning algorithms or encoders (e.g. `sklearn.preprocessing.OrdinalEncoder`, `OneHotEncoder`) detect `category` and immediately treat it as a discrete variable, **without first having to convert it from object**.

---

‚ö†Ô∏è **When is `category` not convenient?**

* if the column has **many unique values** (e.g. a unique code or a customer ID), the conversion brings no benefit: the category table would be as large as the column itself.
* if values are frequently changed (adding new categories), the `category` type is less flexible.

üîç **Practical example**:

```python
    import pandas as pd
    df = pd.DataFrame({
        'sesso': ['M','F','M','F','F']*100000
    })
    print(df['sesso'].memory_usage(deep=True))   # object
    df['sesso'] = df['sesso'].astype('category')
    print(df['sesso'].memory_usage(deep=True))   # category (much less!)
```

---

> üí¨ In brief

| Aspect           | `object`             | `category`                      |
|------------------|----------------------|----------------------------------|
| Base type        | Python strings       | integer codes + category list   |
| Memory           | High                 | Very low                        |
| Speed            | Slower               | Faster                          |
| Semantics        | Free text            | Finite discrete values          |
| Machine Learning | Must be encoded first| Already ready / suitable        |



## The arguments of the `pd.read_csv` function

[Here](https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Data-Gathering/CSV-(Comma-Separated-Values)-Files/csv_file_cheatcodes.ipynb) is an excellent notebook that **illustrates the various arguments** of `pd.read_csv` ‚Äì **downloaded** in the directory of this notebook:

## The `Unnamed: 0` column
See [this chat](https://chatgpt.com/share/68f74bca-554c-8012-a844-7260ce18391d) of ChatGPT. Ask for translation.


# Frequent problems when loading CSV files in pandas.

Here is a list of the **most common problems** you encounter when loading a csv with `pandas.read_csv()`, together with **typical causes** and **solutions**:

üß© **1. Columns ‚ÄúUnnamed: 0‚Äù or ‚ÄúUnnamed: n‚Äù** ‚Äì already seen before

<u>Problem</u>: an unwanted column called `Unnamed: 0` appears.<br>
<u>Cause</u>: often the CSV includes an index saved from a previous `DataFrame.to_csv()` (i.e. `index=True` by default).<br>
<u>Solution</u>:
```python
    pd.read_csv("file.csv", index_col=0)
    # oppure
    pd.read_csv("file.csv").drop(columns=["Unnamed: 0"])
```

‚öôÔ∏è **2. Wrong delimiters** ‚Äì already seen before

<u>Problem</u>: the file is not split correctly (all columns end up in a single one).<br> <u>Cause</u>: the separator is not a comma, but a semicolon `;`, a tab `\t`, or something else.<br> <u>Solution</u>:

```python
    pd.read_csv("file.csv", sep=";")      # for European-style CSV
    pd.read_csv("file.csv", sep="\t")     # for TSV files
```

---

A TSV (*Tab-Separated Values*) is basically a CSV, but instead of `,` or `;` it uses the tab character `\t` as the field separator.
You can also detect it automatically:

```python
    pd.read_csv("file.csv", sep=None, engine="python")
```

üß© When do you use a TSV?

* when the data contains lots of commas or semicolons (e.g. text descriptions).
* when the file is exported by Unix systems or databases (e.g. PostgreSQL COPY TO, Excel ‚Üí ‚ÄúText (tab delimited)‚Äù).
* when you want to avoid ambiguity between decimal separators and field separators.

---

The `engine` parameter in `pandas.read_csv()` is used to tell pandas which **parsing engine** to use to read and interpret the CSV file.<br>
In practice, pandas has **two different parser engines** that do the same job (read the file and turn it into a DataFrame), but **with different features and performance**.

1Ô∏è‚É£ **`engine="c"`** ‚Üí the ‚Äúfast‚Äù parser (default)

* written in C ‚Üí very fast
* it is the default in almost all cases
* great for clean, regular files
* but‚Ä¶ it is less flexible: it doesn‚Äôt support every option and may fail on ‚Äúmessy‚Äù or complex CSVs

Usage example:

```python
    pd.read_csv("file.csv", engine="c")
```

2Ô∏è‚É£ **`engine="python"`** ‚Üí the ‚Äúrobust‚Äù parser

* written in pure Python ‚Üí slower, but more tolerant
* supports options that the C parser doesn‚Äôt handle well, such as:

  * `sep=None` (i.e. **automatic separator detection**),
  * multiple or irregular delimiters,
  * malformed lines (`on_bad_lines`),
  * complex quotes and special characters.

Usage example:

```python
    pd.read_csv("file.csv", sep=None, engine="python")
```

üëâ Here pandas tries to guess the separator automatically (`,`, `;`, `\t`, etc.) by looking at the first rows.

---



Let‚Äôs go back to the list of problems and solutions of `read_csv`.

üî§ **3. Wrong encoding**

<u>Problem</u>: accented characters or special symbols appear as ÔøΩ or raise `UnicodeDecodeError`.<br>
<u>Cause</u>: the file is not in `UTF‚Äë8` but in `latin1`, `cp1252`, etc.<br>
<u>Solution</u>:
```python
    pd.read_csv("file.csv", encoding="latin1")
```

> What is `latin1`?
> - `latin1` (or `ISO‚Äë8859‚Äë1`) is a 1‚Äëbyte (8‚Äëbit) encoding **widely used in Western Europe** before UTF‚Äë8 became standard.
> - It supports many Western European characters:
>   - Italian accented letters: `√†`, `√®`, `√©`, `√¨`, `√≤`, `√π`, ...
>   - Spanish/French/Portuguese letters: `√±`, `√ß`, `√°`, `√©`, `√µ`, ...
>   - German umlaut vowels: `√§`, `√∂`, `√º`, ...
>   - Nordic `√∏`, `√•`
>
> **BUT** latin1 does NOT support:
> - emoji
> - euro symbol `‚Ç¨`
> - Greek, Cyrillic, Arabic, Chinese, etc.
>
> **So `latin1` is basically ‚ÄúWestern Europe in the 90s‚Äù.** If we encounter characters outside that set, pandas either shows the replacement char or throws `UnicodeDecodeError`.

<img src="ascii_latin1_utf_8.png" alt="image" width="600">

Test of **some errors**, in various steps:<br>
1. we create a `DataFrame` with typical `latin1` characters

In [10]:
import pandas as pd

df = pd.DataFrame({
    "ID": [1, 2, 3],
    "Nome": ["Andr√©", "Jos√©", "Ana Mar√≠a"],
    "Citt√†": ["Torino", "M√°laga", "Z√ºrich"],
    "Note": [
        "pagato 50$ gi√† fatturato",              # 'latin1' extends ASCII, that contained character '$', so also 'latin1' accepts it
        "a√±o siguiente -> revisi√≥n t√©cnica",
        "pi√π vecchio -> gi√† sostituito"
    ]
})


2. we save the `DataFrame` to a `latin1` (ISO‚Äë8859‚Äë1) CSV

In [11]:
file_name = "clienti_latin1.csv"
df.to_csv(
    file_name,
    index=False,
    sep=";",             # we also set the ';' separator so it is even more realistic "European style"
    encoding="latin1"    # <-- key point
)

print(f"Creato file {file_name} in encoding latin1")

Creato file clienti_latin1.csv in encoding latin1


3. we try to read it again WITHOUT specifying the encoding.<br>
This is what a distracted user usually does:

In [12]:
try:
    df_fail = pd.read_csv(file_name, sep=";")  # no encoding passed, i.e. default = UTF-8
    print("Letto senza errori?! Ecco le prime righe:")
    print(df_fail.head())
except UnicodeDecodeError as e:
    print("‚ö†Ô∏è Errore di decodifica previsto leggendo senza encoding esplicito:")
    print(e)

‚ö†Ô∏è Errore di decodifica previsto leggendo senza encoding esplicito:
'utf-8' codec can't decode byte 0xe0 in position 12: invalid continuation byte


4. it raises an error: we want to read a `latin1` file as `utf‚Äë8`.<br>
correct solution: we read specifying `encoding=latin1`

In [13]:
df_ok = pd.read_csv(file_name, sep=";", encoding="latin1")
print("\nüí° Lettura corretta con encoding='latin1':")
print(df_ok.head())



üí° Lettura corretta con encoding='latin1':
   ID       Nome   Citt√†                               Note
0   1      Andr√©  Torino           pagato 50$ gi√† fatturato
1   2       Jos√©  M√°laga  a√±o siguiente -> revisi√≥n t√©cnica
2   3  Ana Mar√≠a  Z√ºrich      pi√π vecchio -> gi√† sostituito


5. and what happens with the following dataframe, which contains the `‚Ç¨` character (instead of `$`), which is part of neither `Ascii` nor `latin1`?

In [14]:
import pandas as pd

df = pd.DataFrame({
    "ID": [1, 2, 3],
    "Nome": ["Andr√©", "Jos√©", "Ana Mar√≠a"],
    "Citt√†": ["Torino", "M√°laga", "Z√ºrich"],
    "Note": [
        "pagato 50‚Ç¨ gi√† fatturato",
        "a√±o siguiente -> revisi√≥n t√©cnica",
        "pi√π vecchio -> gi√† sostituito"
    ]
})

This code goes wrong:

In [15]:
file_name = "clienti_latin1.csv"
df.to_csv(
    file_name,
    index=False,
    sep=";",             # we also set the ';' separator so it is even more realistic "European style"
    encoding="latin1"    # <-- key point
)

print(f"Creato file {file_name} in encoding latin1")

UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 24: ordinal not in range(256)

6. it raises an error, because `latin1` does not contain the `‚Ç¨` character.<br>
We must write in `utf‚Äë8`:

In [16]:
file_name = "clienti_latin1.csv"
df.to_csv(
    file_name,
    index=False,
    sep=";",             # we also set the ';' separator so it is even more realistic "European style"
    encoding="utf-8"    # <-- key point
)

print(f"Creato file {file_name} in encoding latin1")

Creato file clienti_latin1.csv in encoding latin1


üìâ **4. Wrong data type**

<u>Problem</u>: numeric columns imported as strings (`object`).<br>
<u>Cause</u>: presence of thousand separators, symbols, or empty cells.<br>
<u>Solution</u>:
```python
    pd.read_csv("file.csv", thousands=".", decimal=",")
```
or after the read:
```python
    df["col"] = pd.to_numeric(df["col"], errors="coerce")
```

In [17]:
import pandas as pd

# ###############################################
# 1. CSV CREATION with EUROPEAN FORMATTING      #
# ###############################################

# Notes:
# - "1.234,50" = one thousand two hundred thirtyfour comma fifty
# - "2.000"    = two thousands (integer)
# - ""         = empty cell (missing value)

df_orig = pd.DataFrame({
    "Prodotto": ["A123", "B777", "C900", "D010"],
    "PrezzoUnitario": ["1.234,50", "99,99", "", "2.000,00"],
    "Quantit√†": ["1.000", "250", "", "1.500"]
})

csv_name = "prezzi_legacy.csv"

# We save as CSV with ';' because it is very common in Italian administrative exports
df_orig.to_csv(
    csv_name,
    index=False,
    sep=";",
    encoding="utf-8"
)

print(f"[OK] Creato file CSV '{csv_name}' con separatori migliaia '.' e decimali ','.\n")



[OK] Creato file CSV 'prezzi_legacy.csv' con separatori migliaia '.' e decimali ','.



In [18]:
# ##################################
# 2. WRONG READ (DEFAULT)          #
# ##################################

print("=== LETTURA SBAGLIATA (senza thousands/decimal) ===")

df_bad = pd.read_csv(
    csv_name,
    sep=";"          # we correctly read the column separator
                     # but we do NOT tell pandas how to interpret numbers (thousands and decimals)
)

print("\nDataFrame letto (sbagliato):")
print(df_bad)

print("\nTipi di dato dopo lettura sbagliata:")
print(df_bad.dtypes)

# Test of numerical operations: here 'Quantit√†' e 'PrezzoUnitario' are still strings (object)
print("\nProvo a sommare la colonna Quantit√† (che √® testo):")
try:
    print(df_bad["Quantit√†"].sum())
except Exception as e:
    print("Errore durante la somma:", e)

=== LETTURA SBAGLIATA (senza thousands/decimal) ===

DataFrame letto (sbagliato):
  Prodotto PrezzoUnitario  Quantit√†
0     A123       1.234,50       1.0
1     B777          99,99     250.0
2     C900            NaN       NaN
3     D010       2.000,00       1.5

Tipi di dato dopo lettura sbagliata:
Prodotto           object
PrezzoUnitario     object
Quantit√†          float64
dtype: object

Provo a sommare la colonna Quantit√† (che √® testo):
252.5


**Why is `PrezzoUnitario` `object` in `df_bad.dtypes` (instead of `float`)?**

For three reasons together:<br>

**1. The decimal separator is a comma, not a dot**<br>
*Example*: *1.234,50*<br>
for pandas (with no extra instructions), *1.234,50* is not a valid number, **it‚Äôs a string**.<br>
pandas actually expects *1234.50* (dot for decimals, no thousands separator).<br>
So it leaves it as text (`object`).

**2. There is a thousands separator `.`**<br>
Look at *2.000,00*:<br>
the ideal for pandas would be *2000.00*<br>
instead it finds *2.000,00*, which looks like ‚Äú2 dot 000 comma 00‚Äù.<br>
For the standard parser this is not a valid float ‚Üí it stays a string (`object`).

Same thing for *1.000* in the `Quantit√†` column: pandas doesn‚Äôt know whether it‚Äôs ‚Äúone thousand‚Äù or ‚Äúone point zero zero zero‚Äù.<br>
So it prefers **not** to guess and keeps it as text (`object`).

**3. There are empty cells**<br>
In the column you have values like "" (empty string).<br>
So in the same column you have:

* *1.234,50* (text)
* *99,99* (text)
* "" (empty text)
* *2.000,00* (text)

Heterogeneous column ‚Üí pandas says: ‚Äúok, everything is `object` (strings) and let‚Äôs move on‚Äù.

If all the values were clear, English-style numbers (1234.50, 99.99, 2000.00, etc.) then pandas would have inferred `float64` by itself.


In [19]:
# ############################################
# 3. RIGHT READ (input parsing)              #
# ############################################

print("\n\n=== LETTURA CORRETTA (thousands='.', decimal=',') ===")

df_good = pd.read_csv(
    csv_name,
    sep=";",
    thousands=".",  # removes thousands separator
    decimal=","     # interprets comma as decimal separator
)

print("\nDataFrame letto (corretto):")
print(df_good)

print("\nTipi di dato dopo lettura corretta:")
print(df_good.dtypes)

print("\nSomma Quantit√† (ora numerica):")
print(df_good["Quantit√†"].sum())

print("\nSomma PrezzoUnitario (notare i NaN dove c'erano celle vuote):")
print(df_good["PrezzoUnitario"].sum())



=== LETTURA CORRETTA (thousands='.', decimal=',') ===

DataFrame letto (corretto):
  Prodotto  PrezzoUnitario  Quantit√†
0     A123         1234.50    1000.0
1     B777           99.99     250.0
2     C900             NaN       NaN
3     D010         2000.00    1500.0

Tipi di dato dopo lettura corretta:
Prodotto           object
PrezzoUnitario    float64
Quantit√†          float64
dtype: object

Somma Quantit√† (ora numerica):
2750.0

Somma PrezzoUnitario (notare i NaN dove c'erano celle vuote):
3334.49


üßæ **5. Header not on the first line** ‚Äì already seen before

<u>Problem</u>: column names are not read correctly.<br>
<u>Cause</u>: the file has descriptive lines or metadata at the beginning.<br>
<u>Solution</u>:
```python
    pd.read_csv("file.csv", header=2)   # if the header is on the third line
```



In [20]:
# creates CSV file
import csv

file_name = "dati_commentati.csv"

with open(file_name, "w", newline="", encoding="utf-8") as f:
    # writes manually two commented lines
    f.write("# Questo file contiene dati di esempio\n")
    f.write("# Formato: ID,Nome,Valore\n")

    writer = csv.writer(f, delimiter=";")

    # Actual header
    writer.writerow(["ID", "Nome", "Valore"])

    # 3 data rows
    writer.writerow([1, "Alpha", 10.5])
    writer.writerow([2, "Beta", 20.0])
    writer.writerow([3, "Gamma", 7.25])

print(f"Creato file {file_name}")


Creato file dati_commentati.csv


In [21]:
# wrong read
pd.read_csv(file_name)

Unnamed: 0,Unnamed: 1,# Questo file contiene dati di esempio
# Formato: ID,Nome,Valore
ID;Nome;Valore,,
1;Alpha;10.5,,
2;Beta;20.0,,
3;Gamma;7.25,,


In [22]:
# right read
pd.read_csv(
    file_name,
    sep=";",        # column separator
    header=2)       # header is in third row (Python counts from 0)


Unnamed: 0,ID,Nome,Valore
0,1,Alpha,10.5
1,2,Beta,20.0
2,3,Gamma,7.25


In [23]:
# smart read

df = pd.read_csv(
    "dati_commentati.csv",
    sep=";",         # field separator
    comment="#"      # ignores all lines beginning with '#'
)

print(df)
print(df.dtypes)

   ID   Nome  Valore
0   1  Alpha   10.50
1   2   Beta   20.00
2   3  Gamma    7.25
ID          int64
Nome       object
Valore    float64
dtype: object


ü™ì **6. File too large**

<u>Problem</u>: `MemoryError` or very slow loading.<br>
<u>Cause</u>: CSV much bigger than available RAM.<br>
<u>Solution</u>:<br>

**Chunk** loading:
```python
    for chunk in pd.read_csv("file.csv", chunksize=100000):
        process(chunk)                                        # 'process' is a user-defined function
```
---

Let‚Äôs see **how *chunks* work** in <u>two parts</u>:

**1. the internal read**:

```python
pd. (..., chunksize=100000)
```

Normally, the function `pd.read_csv("file.csv")` **reads the entire file into RAM** and returns **a single `DataFrame`**.<br>
With `chunksize=100000`, instead, pandas does NOT load everything.<br>
It returns an **`iterator`** (a generator) that yields one `DataFrame` at a time, each with **at most 100,000 rows**.

So:

* first loop ‚Üí rows 0‚Äì99,999
* second loop ‚Üí rows 100,000‚Äì199,999
* third loop ‚Üí etc.

‚Ä¶until the end of the file.

‚ö†Ô∏è This means that **in memory, at any given time, there are only 100k rows**, not millions/billions. This is perfect if **the file is too large to fit in RAM**.

**2. the outer loop**:

```python
    for chunk in ... :
```

`chunk` is a ‚Äúpartial‚Äù pandas `DataFrame`, i.e. a slice of the CSV.<br>
The `for` loops over all slices of the file, one after the other.



üß™ Let‚Äôs look at **a concrete example**, in two steps:

* we define a **function to create the CSV file**, in <u>two variants</u>:

  * the <u>small</u> version of the function (10 rows) ‚Äî useful to inspect by eye
  * the <u>large</u> version of the function (1_000_000 rows) ‚Äî useful for real tests on `chunk`
  * you can choose which version to use by changing only the `n_righe` argument in the call.
* we process it in chunks to **compute the global sum of the `Importo` column**


In [24]:
# STEP 1: definition of a function that creates a big size CSV to test chunk loading

import csv
import random
import datetime

def crea_csv_grande(
    file_name="transazioni_grandi.csv",      # default
    n_righe=1_000_000,                       # default  (1_000_000: sugar syntax to make the number more readable)
    seed=42                                  # default
):
    """
    Crea un CSV con molte righe, con le colonne:
    ID, DataOperazione, Categoria, Importo

    - ID: intero progressivo
    - DataOperazione: data fittizia
    - Categoria: tipo transazione (es. Vendita / Rimborso / Spesa)
    - Importo: float positivo o negativo
    """

    random.seed(seed)

    categorie = [
        "Vendita",
        "Rimborso",
        "Spesa Marketing",
        "Spesa Fornitore",
        "Abbonamento",
        "Servizio"
    ]

    start_date = datetime.date(2024, 1, 1)

    # Creates the CSV file
    with open(file_name, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f, delimiter=",")

        # Header
        writer.writerow(["ID", "DataOperazione", "Categoria", "Importo"])

        for i in range(1, n_righe + 1):
            # data = start_date + offset giorni
            data_operazione = start_date + datetime.timedelta(days=i % 365)

            categoria = random.choice(categorie)

            # Importo (Amount):
            # - positive sales between 10 and 500
            # - negative returns between -200 and -5
            # - negatives expenses between -1000 and -20
            if categoria == "Vendita" or categoria == "Abbonamento" or categoria == "Servizio":
                importo = round(random.uniform(10, 500), 2)
            elif categoria == "Rimborso":
                importo = round(random.uniform(-200, -5), 2)
            else:
                # Spesa Marketing / Spesa Fornitore
                importo = round(random.uniform(-1000, -20), 2)

            writer.writerow([
                i,
                data_operazione.isoformat(),  # tipo 2024-03-15
                categoria,
                importo
            ])

    print(f"Creato file CSV '{file_name}' con {n_righe} righe.")

# Example of use of the function (the MAIN)
# if __name__ == "__main__": tells:
# - "runs this code block just if I'm running it directly within this file, and NOT if I'm importing it from within another file
# 
if __name__ == "__main__":
    # small demo version (to look at it manually)
    crea_csv_grande("transazioni_demo.csv", n_righe=10)

    # large demo version (to test chunksize ecc.)
    # ATTENTION: this code creates ~1 million rows. Change this number as you prefer.
    crea_csv_grande("transazioni_grandi.csv", n_righe=1_000_000)   # 1_000_000: sugar synthax


Creato file CSV 'transazioni_demo.csv' con 10 righe.
Creato file CSV 'transazioni_grandi.csv' con 1000000 righe.


In [26]:
# STEP 2: process in chunks:

totale = 0.0

for chunk in pd.read_csv("transazioni_grandi.csv", chunksize=100_000):   # numeric sugar syntax
    totale += chunk["Importo"].sum()

print("Total amount:", totale)


Total amount: -59772293.91


The code in the previous cell gives **the global sum of the `Importo` column** without ever loading the whole million rows into RAM at once ‚úÖ


---

**Comment on the NUMERICAL result obtained**<br>

The dataset we created earlier has both inflows and outflows.<br>
In the CSV generator we had this logic:

* Categories like `Vendita`, `Abbonamento`, `Servizio` ‚Üí **positive** amounts (revenues, +10 to +500)
* Categories like `"Rimborso"` ‚Üí **negative** amounts (refunds to the customer, -5 to -200)
* `Spesa Marketing` and `Spesa Fornitore` ‚Üí **large negative** amounts (costs, from -1000 to -20)

So:

* sales bring money in,
* expenses and suppliers pull money out,
* and often costs are, in absolute value, larger than sales.

If in the sample there are many cost rows compared to sales, the final balance drops hard ‚Üí hence the very negative total like `-59,772,293.91`.

In business terms: **we‚Äôre spending more than we‚Äôre earning** üòÖ.

Is that a ‚Äúnormal‚Äù result?<br>
**Yes, it‚Äôs consistent with the random generation**:

* negative amounts can go down to -1000
* positive amounts go only up to +500

So even if half the rows were sales and half expenses, the expense side would still win in absolute value.

---


üßÆ **7. Columns with missing or misaligned values**

<u>Problem</u>: rows with different number of columns, error like `ParserError: Error tokenizing data`.<br>
<u>Cause</u>: unclosed quotes or separators inside fields.<br>
<u>Solution</u>:
```python
    pd.read_csv("file.csv", on_bad_lines="skip", quoting=csv.QUOTE_NONE)
```

Or check the delimiters.

Below is the code in 3 steps (for each of the two errors):
- creation of the csv file
- wrong (intentional) reading
- robust reading

In [27]:
# -----------------------------------
# 1) CSV with non closed quotes     -
# -----------------------------------
content_unclosed = """id,name,amount,notes
1,Mario Rossi,1200,OK
2,Luigi Bianchi,950,pagato
3,Carla Verdi,800,"nota con virgolette non chiuse
4,Paolo Neri,700,ok
"""

file_unclosed = "csv_virgolette_non_chiuse.csv"
with open(file_unclosed, "w", encoding="utf-8", newline="") as f:
    f.write(content_unclosed)

# ----------------------------------------------------
# 2) CSV with separators within fields (non quoted)  -
# ----------------------------------------------------
# header: 4 colonne
content_separators = """id,name,city,amount
1,Mario Rossi,Milano,1200
2,Luigi Bianchi,Roma,900
3,Carla Verdi,Milano, Italia,800
4,Paolo Neri,Torino,700
"""

file_separators = "csv_separatori_dentro_campi.csv"
with open(file_separators, "w", encoding="utf-8", newline="") as f:
    f.write(content_separators)

print("‚úÖ 2 test csv files created")


‚úÖ 2 test csv files created


In [28]:
# ==========================
# Read test of file 1      =
# ==========================

print("\n=== 1) TEST: virgolette non chiuse ===")
try:
    df1 = pd.read_csv(file_unclosed)
    print(df1)
except Exception as e:
    print("‚ùå errore atteso (virgolette non chiuse):")
    print(e)





=== 1) TEST: virgolette non chiuse ===
‚ùå errore atteso (virgolette non chiuse):
Error tokenizing data. C error: EOF inside string starting at row 3


In [29]:
# ===============================
# 'robust' read of file 1       =
# ===============================
df1_ok = pd.read_csv(
    file_unclosed,
    on_bad_lines="skip",
    quoting=csv.QUOTE_NONE,
    engine="python",
)
print("\n‚úÖ lettura robusta (file 1):")
print(df1_ok)


‚úÖ lettura robusta (file 1):
   id           name  amount                            notes
0   1    Mario Rossi    1200                               OK
1   2  Luigi Bianchi     950                           pagato
2   3    Carla Verdi     800  "nota con virgolette non chiuse
3   4     Paolo Neri     700                               ok


`quoting = csv.QUOTE_NONE`:<br>
this tells it: ‚Äúdo not treat quotes (") as something special, consider them normal text‚Äù.

Why did we also put `engine="python"`?<br>
...
**Practical rule:**<br>
1 format ‚Üí `parse_dates`<br>
few formats ‚Üí `to_datetime`<br>
mixed/dirty formats ‚Üí **custom function** ‚úÖ

In [30]:
# ==========================
# read test of file 2      =
# ==========================

print("\n=== 2) TEST: separatori dentro i campi ===")
try:
    df2 = pd.read_csv(file_separators)
    print(df2)
except Exception as e:
    print("‚ùå errore atteso (troppi separatori):")
    print(e)


=== 2) TEST: separatori dentro i campi ===
‚ùå errore atteso (troppi separatori):
Error tokenizing data. C error: Expected 4 fields in line 4, saw 5



In [31]:
# ===============================
# 'robust' read of file 2       =
# ===============================

df2_ok = pd.read_csv(
    file_separators,
    on_bad_lines="skip",
    quoting=csv.QUOTE_NONE,
    engine="python",
)
print("\n‚úÖ lettura robusta (file 2):")
print(df2_ok)


‚úÖ lettura robusta (file 2):
   id           name    city  amount
0   1    Mario Rossi  Milano    1200
1   2  Luigi Bianchi    Roma     900
2   4     Paolo Neri  Torino     700


üß† **8. Dates not interpreted correctly**

<u>Problem</u>: dates remain strings or are in the wrong format.<br>
<u>Solution</u>:
```python
pd.read_csv("file.csv", parse_dates=["data"])
```

or:
```python
df["data"] = pd.to_datetime(df["data"], dayfirst=True)
```

Let's create a CSV that deliberately contains mixed dates (Italian...hours) so that `read_csv` does not recognize them and leaves them as `object`.

In [32]:
# 1. Create a CSV file with "bad" dates
# "bad" CSV: different date formats ‚Üí pandas does not manage to make them uniform

from datetime import datetime

csv_text = """data;descrizione;importo
01/02/2025;Fattura cliente A;120.50
2025-02-01;Fattura cliente B;85.00
12/31/2024;Formato USA;15.75
31/12/2024;Chiusura anno;999.99
2025/02/01 14:30;Con orario;50.00
;Data mancante;0.00
non-data;Valore sporco;5.25
"""

with open("date_mischiate.csv", "w", encoding="utf-8", newline="") as f:
    f.write(csv_text)

print("‚úÖ creato date_mischiate.csv\n")

‚úÖ creato date_mischiate.csv



It‚Äôs a CSV file with mixed dates.

Now let‚Äôs do the ‚Äúnaive‚Äù read (pandas doesn‚Äôt understand the dates and leaves them as `object`):


In [33]:
# 2. "wrong" / naive reading
df_raw = pd.read_csv("date_mischiate.csv", sep=";")
print("üî¥ LETTURA INGENUA")
print(df_raw.dtypes)
print(df_raw, "\n")

üî¥ LETTURA INGENUA
data            object
descrizione     object
importo        float64
dtype: object
               data        descrizione  importo
0        01/02/2025  Fattura cliente A   120.50
1        2025-02-01  Fattura cliente B    85.00
2        12/31/2024        Formato USA    15.75
3        31/12/2024      Chiusura anno   999.99
4  2025/02/01 14:30         Con orario    50.00
5               NaN      Data mancante     0.00
6          non-data      Valore sporco     5.25 



`data` is `object` ‚Üí that is, a string.<br>
Now the two solutions:

Solution 1 ‚Äì directly in `read_csv`


In [34]:
# =========================================
# 3) Read with parse_dates
# ‚Üí with these mixed dates, pandas gives up!
# ‚Üí date remains object
# =========================================
df_auto = pd.read_csv(
    "date_mischiate.csv",
    sep=";",
    parse_dates=["data"],
    dayfirst=True
)
print("üü† LETTURA CON parse_dates (pandas non ce la fa)")
print(df_auto.dtypes)
print(df_auto, "\n")
# üëâ date = object
# because in the file there are: dd/mm/yyyy, ISO, mm/dd/yyyy, with time, empty values, text, etc...


üü† LETTURA CON parse_dates (pandas non ce la fa)
data            object
descrizione     object
importo        float64
dtype: object
               data        descrizione  importo
0        01/02/2025  Fattura cliente A   120.50
1        2025-02-01  Fattura cliente B    85.00
2        12/31/2024        Formato USA    15.75
3        31/12/2024      Chiusura anno   999.99
4  2025/02/01 14:30         Con orario    50.00
5               NaN      Data mancante     0.00
6          non-data      Valore sporco     5.25 



In [35]:
# =========================================
# 4) ROBUST read (it must work!)
# ‚Üí now let's read as string
# ‚Üí let's convert them one by one
# =========================================

# definition of a "flexible" parsing function 
def parse_flessibile(x: str):
    if pd.isna(x) or x == "":
        return pd.NaT
    # proviamo pi√π formati noti
    for fmt in ("%d/%m/%Y", "%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d %H:%M", "%d/%m/%Y %H:%M"):
        try:
            return datetime.strptime(x, fmt)
        except ValueError:
            continue
    return pd.NaT   # this is NOT date

# the function 'apply' on the column "date"
df_ok = pd.read_csv("date_mischiate.csv", sep=";")
df_ok["data"] = df_ok["data"].apply(parse_flessibile)

print("üü¢ ROBUST read (after apply)")
print(df_ok.dtypes)
print(df_ok)

üü¢ ROBUST read (after apply)
data           datetime64[ns]
descrizione            object
importo               float64
dtype: object
                 data        descrizione  importo
0 2025-02-01 00:00:00  Fattura cliente A   120.50
1 2025-02-01 00:00:00  Fattura cliente B    85.00
2 2024-12-31 00:00:00        Formato USA    15.75
3 2024-12-31 00:00:00      Chiusura anno   999.99
4 2025-02-01 14:30:00         Con orario    50.00
5                 NaT      Data mancante     0.00
6                 NaT      Valore sporco     5.25


Let's see **line by line** what the code in the previous cell does:

1. `def parse_flessibile(x: str):`<br>
defines a function that receives **a single cell** (a string) and returns a **date** or `NaT`.

2. `if pd.isna(x) or x == "":`<br>
if the cell is empty (`""`) or it is an NA (`NaN` read by pandas) ‚Üí it doesn't even try ‚Üí **returns** `pd.NaT`.<br>
(`pd.NaT` = Not A Time, the equivalent of `NaN` but for dates.)

3. `for fmt in (...):`<br>
here there is the list of the **formats it wants to try**, in order:
    - `"%d/%m/%Y"` ‚Üí 31/12/2024 (Italian)
    - `"%Y-%m-%d"` ‚Üí 2025-02-01 (ISO)
    - `"%m/%d/%Y"` ‚Üí 12/31/2024 (US)
    - `"%Y/%m/%d %H:%M"` ‚Üí 2025/02/01 14:30
    - `"%d/%m/%Y %H:%M"` ‚Üí 31/12/2024 09:15
if one of the formats works ‚Üí it exits the `for` with `return datetime.strptime(...)`.

4. final `return pd.NaT`<br>
if none of the formats worked ‚Üí it returns `NaT`.

5. `df_ok = pd.read_csv(...)`<br>
here we read the whole CSV **as text** (`object`).

6. `df_ok["data"] = df_ok["data"].apply(parse_flessibile)`<br>
here we let the function do the job: row by row ‚Üí cell by cell ‚Üí it tries all the formats.

At the end, the `data` column is a **datetime** column üü¢

---

**Why is there the apply**??

```python
df_ok = pd.read_csv("date_mischiate.csv", sep=";")
df_ok["data"] = df_ok["data"].apply(parse_flessibile)
```

* `read_csv(...)` reads the whole column as strings (because they were all different).
* `df_ok["data"].apply(parse_flessibile)` means:<br>
  ‚Äúfor **each row** of the `data` column execute `parse_flessibile(...)`‚Äù.
* the result is **a new Series of type `datetime`** (with some `NaT` inside).
* it reassigns it to `df_ok["data"]` ‚Üí the column really becomes `datetime64[ns]`.

It‚Äôs the classic trick: **when `parse_dates` is not enough** ‚Üí you do the `apply`.

---

**Why didn‚Äôt we just use `pd.to_datetime(...)`?**

We could have done:

```python
df_ok["data"] = pd.to_datetime(df_ok["data"], dayfirst=True, errors="coerce")
```

and in many cases it‚Äôs fine.<br>
Here, however, we had both Italian and American formats, both with time and without. `to_datetime` alone sometimes guesses right, sometimes not.<br>
With this parsing function, instead, we decide the order of the formats.<br>
Example: first try Italian, then American ‚Üí this way it doesn‚Äôt get 03/04/2025 wrong.

---

**What happens to the ‚Äúdirty‚Äù rows?**

* `""` ‚Üí `NaT`
* `"non-data"` ‚Üí no format understands it ‚Üí `NaT`
* `"2025/02/01 14:30"` ‚Üí it catches it at the 4th format ‚Üí becomes a real `datetime`
* `"12/31/2024"` ‚Üí it catches it at the 3rd format ‚Üí ok
* `"31/12/2024"` ‚Üí it catches it at the 1st format ‚Üí ok

---

```python
print(df_ok.dtypes)
```

data           datetime64[ns]<br>
descrizione            object<br>
importo               float64<br>
dtype: object<br>

This is the result we wanted from the very beginning: the column is no longer `object`, it is a column of dates üí™

---



**FINAL SUMMARY of point 8**

If the dates are **all in the same format** ‚Üí<br>
`pd.read_csv(..., parse_dates=["data"])` is perfectly fine.

If the dates have **different but ‚Äúsimilar‚Äù formats** (all European, or all ISO) ‚Üí<br>
you can read normally and then do:

```python
df["data"] = pd.to_datetime(df["data"], dayfirst=True, errors="coerce")
```

often that's enough.

If the dates are really **heterogeneous** (eu, us, with time, empty, text) ‚Üí<br>
`parse_dates` alone is not enough, and sometimes not even a guessed `to_datetime(...)`;<br>
in that case a **flexible parsing function** that tries multiple formats is convenient.

**Practical rule:**<br>
1 format ‚Üí `parse_dates`<br>
few formats ‚Üí `to_datetime`<br>
mixed/dirty formats ‚Üí **custom function** ‚úÖ


‚¨ú **9. Duplicates or whitespace in column names**

<u>Problem</u>: names with spaces or duplicates (`'Nome '` ‚â† `'Nome'`).<br> <u>Solution</u>:

```python
    df.columns = df.columns.str.strip()
```


In [36]:
# 1) I create a CSV with column names with spaces
csv_text = """  data  ;  nome cliente ; importo ;  note
01/02/2025;Mario Rossi;120.50;pagato
02/02/2025;  Anna Bianchi ;89.00;ritardo
03/02/2025;ACME S.p.A.;250.00;
"""

file_name = "con_spazi.csv"
with open(file_name, "w", encoding="utf-8", newline="") as f:
    f.write(csv_text)

print(f"‚úÖ creato {file_name}")


‚úÖ creato con_spazi.csv


In [37]:
# 2) "normal" read ‚Üí column names are dirty
df = pd.read_csv(file_name, sep=";")
print("üîé colonne lette (sporche):")
print(repr(df.columns.tolist()))


üîé colonne lette (sporche):
['  data  ', '  nome cliente ', ' importo ', '  note']


In [38]:
# 3) cleans column names
df.columns = df.columns.str.strip()

print("\n‚úÖ colonne dopo strip():")
print(repr(df.columns.tolist()))

print("\nüìÑ dataframe finale:")
print(df)


‚úÖ colonne dopo strip():
['data', 'nome cliente', 'importo', 'note']

üìÑ dataframe finale:
         data     nome cliente  importo     note
0  01/02/2025      Mario Rossi    120.5   pagato
1  02/02/2025    Anna Bianchi      89.0  ritardo
2  03/02/2025      ACME S.p.A.    250.0      NaN


**What happened?**

* before `strip()` the columns are like:<br>
  [*'  data  ', '  nome cliente ', ' importo ', '  note ']*<br>
* after:<br>
  *['data', 'nome cliente', 'importo', 'note']*<br>

So if you want to do:

```python
  df["data"]
```

now it works, while before you would have had to write `df[" data "]` and that‚Äôs not nice üòÖ


üß± **10. Quotes and special characters**

<u>Problem</u>: CSV with inner quotes, double quotes, etc.<br> <u>Solution</u>:

```python
    pd.read_csv("file.csv", quotechar='"', escapechar='\\')
```


Let‚Äôs look at the usual flow (like for the previous errors), in this case with 4 steps:

* we create a CSV **deliberately dirty, but not enough --> it manages to read it
  (with quotes inside, doubled quotes, backslash‚Ä¶)**
* we try the **naive read** ‚Üí it gets messed up / errors out / splits the columns
* we do the **robust read** ‚Üí `quotechar='"'`, `escapechar='\\'`, and optionally also `engine="python"` to be safe.


In [39]:
from pathlib import Path

# =========================================================
# CASE A - CSV "dirty but readable"
# ---------------------------------------------------------
# Here we want to show that: even if there are inner quotes
# and backslashes, the naive read *might* still work.
# =========================================================

file_ok = "csv_sporco_ok.csv"

csv_ok = (
    # row 1
    'id,descrizione,note\n'
    # row 2
    '1,"Martello, 500g","tutto ok"\n'
    # row 3 - here there is the backslash + quotes: \"piatto\"
    '2,"Cacciavite \\"piatto\\"","virgolette con backslash (\\")"\n'
    # row 4 - doubled quotes, CSV style
    '3,"Set ""professionale"" 24 pz","virgolette raddoppiate nel campo descrizione"\n'
    # row 5 - Windows path
    '4,"C:\\\\attrezzi\\\\nuovo","percorso Windows con backslash"\n'
)

Path(file_ok).write_text(csv_ok, encoding="utf-8")
print(f"‚úÖ Scritto {file_ok}")
print(csv_ok)

print("\n=== CASO A - LETTURA INGENUA (funziona) ===")
# üëâ HERE we do *NOT* set either quotechar or escapechar
df_ok_naive = pd.read_csv(file_ok)
print(df_ok_naive)


‚úÖ Scritto csv_sporco_ok.csv
id,descrizione,note
1,"Martello, 500g","tutto ok"
2,"Cacciavite \"piatto\"","virgolette con backslash (\")"
3,"Set ""professionale"" 24 pz","virgolette raddoppiate nel campo descrizione"
4,"C:\\attrezzi\\nuovo","percorso Windows con backslash"


=== CASO A - LETTURA INGENUA (funziona) ===
   id                descrizione                                          note
0   1             Martello, 500g                                      tutto ok
1   2      Cacciavite \piatto\""                 virgolette con backslash (\)"
2   3  Set "professionale" 24 pz  virgolette raddoppiate nel campo descrizione
3   4        C:\\attrezzi\\nuovo                percorso Windows con backslash


As you can see:

* the naive read of the `csv_sporco_ok.csv` file **worked** (even without `quotechar` + `escapechar`)
* that is, the read ‚Äúdoes not always throw an error, but it‚Äôs better to specify‚Äù with `quotechar` + `escapechar`

Obviously, a fortiori the robust read also works, as you can see from the following cell:


In [40]:
print("\n=== CASE A - ROBUST READ ===")
df_ok_safe = pd.read_csv(
    file_ok,
    quotechar='"',
    escapechar='\\',
    engine="python",
)
print(df_ok_safe)


=== CASE A - ROBUST READ ===
   id                descrizione                                          note
0   1             Martello, 500g                                      tutto ok
1   2        Cacciavite "piatto"                  virgolette con backslash (")
2   3  Set "professionale" 24 pz  virgolette raddoppiate nel campo descrizione
3   4          C:\attrezzi\nuovo                percorso Windows con backslash


Let‚Äôs now create, instead, a CSV file ‚Äúbroken‚Äù in another way:

In [41]:
# =========================================================
# CASE B - CSV BROKEN ON PURPOSE (unbalanced quotes)
# ---------------------------------------------------------
# Here we want to show the case in which the
# naive read does NOT work.
# The idea is to put a line with: "Pliers with "quotes" inside"
# but WITHOUT escape and with a comma inside ‚Üí the C parser chokes.
# =========================================================

file_bad = "csv_sporco_rotto.csv"

csv_bad = (
    'id,descrizione,prezzo\n'                    # row 1 (header)
    '1,"Martello",12.5\n'                        # row 2 ok
    '2,"Cacciavite \\"piatto\\"",8.9\n'          # row 3 ok (has \")
    '3,"Set ""professionale"" 24 pz",49.0\n'     # row 4 ok (has "")
    # row 5 - THIS BREAKS IT:
    #   - quoted field that contains other NON-escaped quotes
    #   - and it also contains a comma ‚Üí the parser thinks another field starts/ends
    '4,"Pinza con "virgolette" dentro, con virgola",15.0\n'
)

Path(file_bad).write_text(csv_bad, encoding="utf-8")
print(f"\n‚úÖ Scritto {file_bad}")
print(csv_bad)


print("\n=== CASO B - LETTURA INGENUA (DEVE FALLIRE) ===")
try:
    df_bad_naive = pd.read_csv(file_bad)
    print(df_bad_naive)
except Exception as e:
    # here you expect something like:
    # ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 4
    print("‚ùå Lettura ingenua fallita:", e)



‚úÖ Scritto csv_sporco_rotto.csv
id,descrizione,prezzo
1,"Martello",12.5
2,"Cacciavite \"piatto\"",8.9
3,"Set ""professionale"" 24 pz",49.0
4,"Pinza con "virgolette" dentro, con virgola",15.0


=== CASO B - LETTURA INGENUA (DEVE FALLIRE) ===
‚ùå Lettura ingenua fallita: Error tokenizing data. C error: Expected 3 fields in line 5, saw 4



With the two arguments instead the robust read (with `quotechar='"'` and `escapechar='\\'`) **works**!

In [42]:
print("\n=== CASO B - LETTURA ROBUSTA  ===")
df_bad_safe = pd.read_csv(
    file_bad,
    quotechar='"',
    escapechar='\\',
    engine="python",
    on_bad_lines="warn",   # or "skip" to skip the broken rows
)
print(df_bad_safe)


=== CASO B - LETTURA ROBUSTA  ===
   id                descrizione  prezzo
0   1                   Martello    12.5
1   2        Cacciavite "piatto"     8.9
2   3  Set "professionale" 24 pz    49.0



  df_bad_safe = pd.read_csv(


It read it, but it got a **warning**!<br>
It is telling us something very precise: even with the ‚Äúmore tolerant‚Äù parser (`engine="python"`) and even with `quotechar` / `escapechar`, line 5 is really broken at CSV level. It‚Äôs not just ‚Äúhard‚Äù, it is syntactically wrong: there are quotes opened and not closed, and on top of that there is a comma inside.

So: it‚Äôs normal that it skips it. We made it like this on purpose to show the case in which ‚Äúnot even the robust read can save it‚Äù and pandas says ‚Äúok, I throw it away and go on‚Äù.

Now there are **two ways** to avoid the warning (the row skip):


<u>First way</u>: standard **CSV style** ‚Üí **we double the quotes**:

In [43]:
import pandas as pd
from pathlib import Path

file_good = "csv_sporco_riparato_doppie.csv"

csv_good = (
    'id,descrizione,prezzo\n'
    '1,"Martello",12.5\n'
    '2,"Cacciavite \\"piatto\\"",8.9\n'
    '3,"Set ""professionale"" 24 pz",49.0\n'
    # üëá here is right: internal quotes are doubled
    '4,"Pinza con ""virgolette"" dentro, con virgola",15.0\n'
)

Path(file_good).write_text(csv_good, encoding="utf-8")

print("\n=== LETTURA ROBUSTA (CSV VALIDO, virgolette raddoppiate) ===")
df_ok = pd.read_csv(
    file_good,
    quotechar='"',
    escapechar='\\',
    engine="python",
)
print(df_ok)


=== LETTURA ROBUSTA (CSV VALIDO, virgolette raddoppiate) ===
   id                                 descrizione  prezzo
0   1                                    Martello    12.5
1   2                         Cacciavite "piatto"     8.9
2   3                   Set "professionale" 24 pz    49.0
3   4  Pinza con "virgolette" dentro, con virgola    15.0


There is no warning anymore! Why?<br> Because:

* the field is quoted `"..."`,
* inside there are quotes ‚Üí we rewrote them as `""`,
* inside there is also the comma ‚Üí but since the field is quoted, the comma is ok.


<u>Second way</u>: **‚Äú`escapechar`‚Äù** style ‚Üí we use the backslash inside (if we really want to show the use of `escapechar='\\'`):

In [44]:
from pathlib import Path

file_good2 = "csv_sporco_riparato_escape.csv"

csv_good2 = (
    'id,descrizione,prezzo\n'
    '1,"Martello",12.5\n'
    '2,"Cacciavite \\"piatto\\"",8.9\n'
    '3,"Set ""professionale"" 24 pz",49.0\n'
    # üëá qui ESCAPE ALL internal quotes
    '4,"Pinza con \\"virgolette\\" dentro, con virgola",15.0\n'
)

Path(file_good2).write_text(csv_good2, encoding="utf-8")

print("\n=== LETTURA ROBUSTA (CSV VALIDO, escape con backslash) ===")
df_ok2 = pd.read_csv(
    file_good2,
    quotechar='"',
    escapechar='\\',   # üëà now it's really required
    engine="python",
)
print(df_ok2)


=== LETTURA ROBUSTA (CSV VALIDO, escape con backslash) ===
   id                                 descrizione  prezzo
0   1                                    Martello    12.5
1   2                         Cacciavite "piatto"     8.9
2   3                   Set "professionale" 24 pz    49.0
3   4  Pinza con "virgolette" dentro, con virgola    15.0


As you can see, again: no warning ‚úÖ

**SUMMARY of point 10**:

* ‚Äúnaive read‚Äù: `pd.read_csv("file.csv")` ‚Üí if the CSV is formally correct, it reads; if it‚Äôs a bit dirty, sometimes it reads; if it‚Äôs really broken, it raises `ParserError`.
* ‚Äúrobust read‚Äù: `pd.read_csv("file.csv", quotechar='"', escapechar='\\', engine="python", on_bad_lines="warn")` ‚Üí **it reads more, but it cannot invent quotes that aren‚Äôt there** ‚Üí and so it gives the **warning** we saw before.
* if we do NOT want the warning ‚Üí the two reading styles we saw (or we rewrite the CSV in a valid way (doubling `""` or escaping `\"`)).


**Summary of the 10 cases**

| Problem | Main cause | Pandas error / symptom example | Suggested solution |
|--------|------------|---------------------------------|--------------------|
| Column `Unnamed: 0` | The index was saved in the CSV (from `to_csv(index=True)`) | No error, but an extra column `"Unnamed: 0"` appears | `pd.read_csv("file.csv", index_col=0)` or `df.drop(columns=["Unnamed: 0"])` |
| All columns merged into one | Wrong separator (`;`, `,`, `\t`, etc.) | No error, but `df.shape` shows only 1 column | `pd.read_csv("file.csv", sep=";")` or `sep="\t"` |
| Strange characters (ÔøΩ) | Wrong encoding (UTF-8 vs Latin1 vs CP1252) | `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0` | `pd.read_csv("file.csv", encoding="latin1")` or `encoding="cp1252"` |
| Numeric values read as strings | Decimal/thousands separators or symbols | No error, but `df.dtypes` shows `object` | `pd.read_csv("file.csv", decimal=",", thousands=".")` or `pd.to_numeric(..., errors="coerce")` |
| Header not on the first line | Comment rows or metadata before the header | No error, but columns are numbered (0, 1, 2, ‚Ä¶) | `pd.read_csv("file.csv", header=2)` or `skiprows=2` |
| File too large / not enough memory | File > available RAM | `MemoryError` or kernel crash | `pd.read_csv(..., chunksize=100000)` or use Dask/Polars |
| Rows with different number of columns | Unclosed quotes or irregular delimiters | `ParserError: Error tokenizing data. C error: Expected N fields in line X, saw M` | `pd.read_csv(..., on_bad_lines="skip", engine="python")` |
| Dates not recognized | Non-standard format or day/month ambiguity | No error, but `object` instead of `datetime64` | `pd.read_csv(..., parse_dates=["Data1"])` or `pd.to_datetime(..., dayfirst=True)` |
| Column names with spaces / duplicates | Header with spaces, double tabs, or invisible characters | No error, but `KeyError` when accessing the name | `df.columns = df.columns.str.strip()` or `rename` |
| Complex quotes / escapes | Fields with double quotes or inner separators | `ParserError: unexpected end of data` or partial parsing | `pd.read_csv(..., quotechar='"', escapechar='\\')` |
| Unknown separator | Non-standard file, mixed `, ; \t` | `ParserError` or wrong columns | `pd.read_csv("file.csv", sep=None, engine="python")` |
| Slow performance | `"python"` engine and large file | Very slow loading | `engine="c"` or `chunksize` |


**A second set of examples follows** with **data correction**:<br>

1Ô∏è‚É£ creation of a **‚Äúdirty‚Äù** CSV file with various **real errors and inconsistencies**;<br>
2Ô∏è‚É£ the full Python code to read it correctly with `pandas.read_csv()`;<br>
3Ô∏è‚É£ **<u>the Python code to clean the dataframe</u>** ‚ùó


In [45]:
# =========================
# 1. CREATE THE "DIRTY" CSV FILE
# =========================

csv_content = """# Example data experted from legacy system
# Contain errors in formats, encodings and separators
ID; Nome ; Et√† ; Data_nascita ; Stipendio ; Note
0; "Mario Rossi"; 35 ; 12/05/1989 ; "2.500,50" ; "Lavora a Roma, ottimo rendimento"
1; "Anna Bianchi"; 29 ; 01/09/1995 ; "3.200,00" ; "Milano, nuovi progetti"
2; "Jos√© √Ålvarez"; 40 ; 15/02/1984 ; "4.000,75" ; "Problemi di encoding √†√®√¨√≤√π"
3; "Luigi Verdi"; "?" ; 03/11/1990 ; "2,800.00" ; "Errore nei separatori decimali"
4; "Giulia Rossi" ; 27 ; 31-08-1997 ; "3.000,00" ; "Riga OK"
5; "Paolo Bianchi" ; 33 ; 02/04/1991 ; "N/A" ; "Valore mancante stipendio"
6; "Marco, Test"; 38 ; 07/07/1986 ; "2.900,00" ; "Virgola nel nome"
7 "Sara Neri" ; 31 ; 10/10/1993 ; "3.200,00" ; Riga con separatore mancante
8; "Laura Verdi"; 25 ; 21/06/1999 ; "3.000,00"
9; "Andrea Neri" ; ; ; ; "Campi mancanti"
Unnamed: 0; "Extra colonna inutile"; ; ; ;
"""

with open("dati_sporchi.csv", "w", encoding="latin1") as f:
    f.write(csv_content)

print("‚úÖ File 'dati_sporchi.csv' creato.\n")


‚úÖ File 'dati_sporchi.csv' creato.



2Ô∏è‚É£ Robust read:

In [46]:
# =========================
# 2. ROBUST READING
# =========================
df = pd.read_csv(
    "dati_sporchi.csv",
    sep=";",                     # European separator
    comment="#",                 # ignore comment lines
    engine="python",             # more flexible parser
    encoding="latin1",           # handle accents
    on_bad_lines="skip",         # skip wrong rows
    skip_blank_lines=True,       # ignore empty lines
    skipinitialspace=True        # remove spaces after
)

print("Original columns:", df.columns.tolist(), "\n")


Original columns: ['ID', 'Nome ', 'Et√† ', 'Data_nascita ', 'Stipendio ', 'Note'] 



In [47]:
df.head()

Unnamed: 0,ID,Nome,Et√†,Data_nascita,Stipendio,Note
0,8,Laura Verdi,25.0,21/06/1999,"3.000,00",
1,Unnamed: 0,Extra colonna inutile,,,,


3Ô∏è‚É£  Cleaning  the dataframe:

In [48]:
# =========================
# 3.1 COLUMN NAME CLEANING
# =========================
df.columns = df.columns.str.strip()                               # remove spaces
df.columns = df.columns.str.replace("√É", "√†", regex=False)        # fix wrong accents
df = df.loc[:, ~df.columns.str.contains("^Unnamed", case=False)]  # remove Unnamed columns

In [49]:
df.head()

Unnamed: 0,ID,Nome,Et√†,Data_nascita,Stipendio,Note
0,8,Laura Verdi,25.0,21/06/1999,"3.000,00",
1,Unnamed: 0,Extra colonna inutile,,,,


In [50]:
# =========================
# 3.2 TYPICAL TRANSFORMATIONS
# =========================

# -- "Et√†" column
if "Et√†" in df.columns:
    df["Et√†"] = pd.to_numeric(df["Et√†"], errors="coerce")

# -- "Stipendio" column
if "Stipendio" in df.columns:
    df["Stipendio"] = (
        df["Stipendio"]
        .astype(str)
        .str.replace(".", "", regex=False)  # remove dots (thousands)
        .str.replace(",", ".", regex=False) # convert comma to dot
    )
    df["Stipendio"] = pd.to_numeric(df["Stipendio"], errors="coerce")

# -- "Data_nascita" column
if "Data_nascita" in df.columns:
    df["Data_nascita"] = pd.to_datetime(df["Data_nascita"], dayfirst=True, errors="coerce")


In [51]:
df.head()  

Unnamed: 0,ID,Nome,Et√†,Data_nascita,Stipendio,Note
0,8,Laura Verdi,25.0,1999-06-21,3000.0,
1,Unnamed: 0,Extra colonna inutile,,NaT,,


In [52]:
# =========================
# 3.3 FINAL RESULT
# =========================
print("‚úÖ File caricato e pulito correttamente!\n")
display(df)
print("\nTipi di dato:\n", df.dtypes)

‚úÖ File caricato e pulito correttamente!



Unnamed: 0,ID,Nome,Et√†,Data_nascita,Stipendio,Note
0,8,Laura Verdi,25.0,1999-06-21,3000.0,
1,Unnamed: 0,Extra colonna inutile,,NaT,,



Tipi di dato:
 ID                      object
Nome                    object
Et√†                    float64
Data_nascita    datetime64[ns]
Stipendio              float64
Note                   float64
dtype: object


# Speeding up loading a large CSV file in pandas

How to speed up loading a very large CSV file in pandas with read_csv?<br>
Here is the **recommended workflow**:

```python
    import pandas as pd

    # First read, with basic optimizations
    df_iter = pd.read_csv(
        "bigdata.csv",
        usecols=["A", "B", "C"],
        dtype={"A": "int32", "B": "float32"},
        chunksize=1_000_000,
        engine="pyarrow"
    )

    # Incremental processing
    df = pd.concat(df_iter)

    # Save in optimized format
    df.to_parquet("bigdata.parquet")
```

üëâ Subsequent reads from Parquet or Feather will be **up to 50√ó faster**.

**Small practical tricks**

* pre-load the files into RAM (e.g. `cat file.csv > /dev/null` on Linux) if the bottleneck is the disk.
* if you often work with the same data ‚Üí convert to Parquet right away.
* if the file is remote ‚Üí use `storage_options` (e.g. S3 or GDrive) for direct reading.
* if you don‚Äôt need the index ‚Üí `index_col=False` or `index_col=None`.
* to measure the effect: use `%%time` in Jupyter or VSC or `timeit`.


**Here is a practical guide to speed things up** üëá

How to speed up pandas.read_csv() on large files

1Ô∏è‚É£ **Specify data types (dtype)**<br>
If you don‚Äôt declare them, pandas has to ‚Äúguess‚Äù types by scanning rows ‚Üí slow and memory-hungry.

```python
    dtypes = {
        "id": "int32",
        "categoria": "category",
        "prezzo": "float32",
        "quantita": "int16"
    }
    df = pd.read_csv("file.csv", dtype=dtypes)
```

‚úÖ Advantages: much faster loading and a lighter dataframe.

2Ô∏è‚É£ **Read only some columns**<br>
If you don‚Äôt need them all, declare usecols:

```python
    df = pd.read_csv("file.csv", usecols=["id", "prezzo", "quantita"])
```

‚úÖ You save time and memory.

3Ô∏è‚É£ **Disable what you don‚Äôt need**

No index:

```python
    index_col=False
```

No complex missing-number detection:

```python
    keep_default_na=False
    na_values=[""]
```

No automatic date conversion:

```python
    parse_dates=False
```

‚úÖ All of this avoids expensive inference.

4Ô∏è‚É£ **Use chunking (block reading)**

If the file is too large for RAM, read it in pieces:

```python
    chunks = pd.read_csv("file.csv", chunksize=1_000_000)
    for chunk in chunks:
        # process the chunk
        process(chunk)
```

‚úÖ You keep memory usage low and can process in streaming.

5Ô∏è‚É£ **Specify the engine**

pandas can use two engines:

* `engine='c'` (default, written in C) ‚Üí faster
* `engine='python'` ‚Üí more flexible but slower

Make sure you use:

```python
    pd.read_csv("file.csv", engine="c")
```

6Ô∏è‚É£ **‚ÄúWarm up‚Äù the disk cache**

On Linux:

```bash
    cat file.csv > /dev/null
```

‚Üí this way the file is already in RAM cache and the next read_csv will be faster.
(doesn‚Äôt improve the first read, but the following ones do)

7Ô∏è‚É£ **Convert to Parquet as soon as you can**

CSV ‚Üí Parquet once, then always work in Parquet:

```python
    df = pd.read_csv("file.csv")
    df.to_parquet("file.parquet")
```

and later:

```python
    df = pd.read_parquet("file.parquet")
```

‚úÖ Often 5‚Äì10√ó faster to read and 3‚Äì4√ó less disk space.

8Ô∏è‚É£ Alternative: use Dask or Polars

If the file is huge (tens of GB):

```python
    dask.dataframe.read_csv()
```

‚Üí parallel reading on multiple cores;

```python
    polars.read_csv()
```

‚Üí super-fast Rust engine (even 10√ó faster than pandas).

9Ô∏è‚É£ Always measure with %%time

In Jupyter or VS Code:

```python
    %%time
    df = pd.read_csv("file.csv", dtype=dtypes, usecols=cols)
```

Compare various versions and pick the fastest one in your context.


# Application to the financial file `FinancialIndicators`

The *Credit_ISLR* file is very small. Let‚Äôs use the larger csv file *FinancialIndicators.csv*:

* about 7000 rows
* 73 columns
* about 2.4 GB
* separator = ',' (US file)


In [53]:
import time
start_time = time.time()

df_FI = pd.read_csv('FinancialIndicators.csv')

end_time = time.time()

print ('Execution Total Time: ', end_time - start_time)

df_FI.head()

Execution Total Time:  0.0381169319152832


Unnamed: 0,Company Name,Industry Name,SIC,Exchange,Country,Stock Price,% Chg in last year,Trading Volume,# of shares outstanding,Market Cap,...,Trailing Net Income,Dividends,Intangible Assets/Total Assets,Fixed Assets/Total Assets,Market D/E,Market Debt to Capital,Book Debt to Capital,Dividend Yield,Insider Holdings,Institutional Holdings
0,@Road Inc,Telecom. Services,4810,NDQ,US,5.23,-0.02,236397,54.8,319.6,...,27.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,,0.23
1,1-800 Contacts Inc,Medical Supplies,8060,NDQ,US,11.7,0.03,57921,13.3,151.9,...,3.3,0.0,0.48,0.19,0.16,0.14,0.29,0.0,,0.39
2,1-800-ATTORNEY Inc,Publishing,2700,NDQ,US,1.01,0.0,1438,0.0,0.0,...,-1.0,0.0,,,,,,0.0,,0.0
3,1-800-FLOWERS.COM,Internet,7370,NDQ,US,6.42,-0.01,197850,65.2,422.9,...,7.8,0.0,0.31,0.2,0.01,0.01,0.03,0.0,0.21,0.88
4,1mage Software Inc,Computer Software/Svcs,3579,NDQ,US,0.01,0.0,10200,3.3,0.03,...,-0.8,0.0,0.0,0.0,12.12,0.92,,0.0,,0.0


 Applichiamo gli argomenti sopra elencati al caricamento di questo file.

# Performance

As for the **performance of the various formats** (in terms of memory usage, saving to disk, and opening/reading) see the following useful study.

The key message of the study is that:

* the CSV format is much better than Excel (not even taken into consideration in the comparison), and it is available in all *data management* environments
* for big data (as we will see) the best format is Parquet, especially in terms of memory usage.


In [None]:
# Use example:
show_pdf("I_O Optimization in Data Projects - by Avi Chawla.pdf")   # seethe last chapter on reading PDF in VSC

# Data format for big data

Is it possible to load big data with 5M rows in *pandas*? It depends.

The short answer is: yes, pandas can also handle 5 million rows, but it depends on what you mean by ‚Äúhandle‚Äù and on how much RAM you have available.


In [54]:
import pandas as pd
import glob
import os

In [55]:
# r prefix tells Python NOT to interpret \ as escape
path = r'C:\Users\Utente\Desktop\salvataggi\SALVATAGGIO DATI\Documents\Seminari\Data Science (corsi)\Corso Python base\linkage\file_csv'

all_files = glob.glob(os.path.join(path, "*.csv"))

In [56]:
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

In [57]:
frame.shape

(5749132, 12)

In [58]:
frame.head()

Unnamed: 0,id_1,id_2,cmp_fname_c1,cmp_fname_c2,cmp_lname_c1,cmp_lname_c2,cmp_sex,cmp_bd,cmp_bm,cmp_by,cmp_plz,is_match
0,37291,53113,0.833333333333333,?,1.0,?,1,1,1,1,0,True
1,39086,47614,1.0,?,1.0,?,1,1,1,1,1,True
2,70031,70237,1.0,?,1.0,?,1,1,1,1,1,True
3,84795,97439,1.0,?,1.0,?,1,1,1,1,1,True
4,36950,42116,1.0,?,1.0,1,1,1,1,1,1,True


In [59]:
frame.tail()

Unnamed: 0,id_1,id_2,cmp_fname_c1,cmp_fname_c2,cmp_lname_c1,cmp_lname_c2,cmp_sex,cmp_bd,cmp_bm,cmp_by,cmp_plz,is_match
5749127,47892,98941,1,?,0.166667,?,1,0,0,1,0,False
5749128,53346,74894,1,?,0.222222,?,1,0,0,1,0,False
5749129,18058,99971,0,?,1.0,?,1,0,0,0,0,False
5749130,84934,95688,1,?,0.0,?,1,0,1,0,0,False
5749131,20985,57829,1,1,0.0,?,1,0,1,1,0,False


---

‚öôÔ∏è **1. It depends on the total size in memory**

Pandas works entirely in RAM.

Example:

* 5 million rows √ó 50 columns
* each cell takes ~8 bytes (`float64`)<br>
  üëâ $5{,}000{,}000 √ó 50 √ó 8 ‚âà 2 \text{ GB}$

So a 200 MB CSV file can become **2‚Äì3 GB in RAM** once loaded, because of conversion to numeric types, indexes, metadata, etc.

If you have a PC with **16 GB of RAM**, it‚Äôs fine; if you have 8 GB, pandas can do it but it will be slow and you may see ‚Äúswap‚Äù or crashes due to lack of memory.

---


üß† **2. Operations that pandas handles well even with 5M rows**

With ‚Äúdecent‚Äù hardware (modern CPU, 16 GB RAM) pandas comfortably handles:

‚úÖ **CSV loading**

```python
    df = pd.read_csv("dati.csv")
```

You can also use:

* `dtype=` to better type the columns (less RAM);
* `usecols=` to read only some columns;
* `chunksize=` to read in blocks.

‚úÖ **Elementary operations and aggregations**

* `df.describe()`, `df.mean()`, `df.groupby("col").agg(...)`
* `df.sort_values("col")`
* `df.query("x > 10 and y < 5")`
* `df.sample(100_000)`<br>
  all doable.

‚úÖ **Moderate joins and merges**<br>
Up to a few million rows per table:

```python
    pd.merge(df1, df2, on="id", how="inner")
```

works, but watch out for memory spikes.


---

üö´ **Operations that start to become problematic**

When the dataset exceeds **5‚Äì10 million rows or goes over 5 GB in RAM**, here‚Äôs what slows down or blows up:

‚ùå **multiple sorts or complex sorts**

```python
    df.sort_values(["col1", "col2"])
```

This creates a copy in memory as large as the DataFrame itself.

‚ùå **very large merge / join**<br>
if the two tables together exceed the available RAM.

‚ùå **row-by-row apply / lambda**

```python
    df.apply(lambda row: f(row.x), axis=1)
```

Very slow: they are executed in pure Python, not in C.<br>
Better to use **vectorized** functions (`np.where`, `pd.Series.map`, etc.).

‚ùå **iterative operations**<br>
Loops like `for row in df.itertuples()` on millions of rows ‚Üí a disaster!

‚ùå **Writing to CSV/parquet**

```python
    df.to_csv("file.csv")
```

Not very efficient.


---

‚ö° **Alternatives and strategies**

**1. Use *chunking***

The following code is **risky**

```python
import pandas as pd

df = pd.read_csv("dati.csv")
df["media"] = df["valore"].mean()
```

Better to read in blocks and process iteratively:

```python
import pandas as pd

chunksize = 500.000   # reads 500 thousand rows at a time
risultati = []        # list where to accumulate the results

for chunk in pd.read_csv("dati.csv", chunksize=chunksize):
    media_chunk = chunk["valore"].mean()       # calculation on the part just read
    risultati.append(media_chunk)              # save the partial result

# after the loop you can combine the results
media_totale = sum(risultati) / len(risultati)
print("Media complessiva:", media_totale)

```

**2. Use the *parquet* data format**<br>
See the next chapter.

**3. Use `cuDF`**<br>
Uses the GPU without changes to Pandas code

**4. Use `Spark`**<br>


# A very, very large CSV file

Can we load a CSV file like [this one](https://www.kaggle.com/datasets/aadimator/nyc-realtime-traffic-speed-data/data)? it‚Äôs almost 30GB.<br>
*Download* --> *Download dataset as zip (10 GBs)*.<br>
Its name is **DOT_Traffic_Speeds_NBE.csv** and it refers to traffic in the city of NewYorl.

Let‚Äôs see what it means:


---
That Kaggle dataset is an **export** of NYC DOT‚Äôs **real-time traffic speed feed** (‚ÄúDOT Traffic Speeds NBE‚Äù).

Each row is a **timestamped observation for one road segment (a ‚Äúlink‚Äù)** with the average **speed** and **travel time** between the segment‚Äôs start and end points. It‚Äôs maintained by NYC DOT and mirrored to Kaggle. ([Kaggle][1])

Here‚Äôs what the **fields mean** (names may appear in UPPER_CASE on Kaggle):

* **ID / LINK_ID**
  Unique identifier of the road **segment** (link) from TRANSCOM (regional traffic consortium). `LINK_ID` is the same as `ID`. Use this as **the key to group or join**. ([Sito Ufficiale di New York City][2])

* **SPEED**
  **Average speed (mph)** vehicles traveled **across the whole segment** during the most recent interval. It‚Äôs not spot speed at a point‚Äîthink ‚Äúsegment travel speed.‚Äù Expect missing or zero values at times. ([Sito Ufficiale di New York City][2])

* **TRAVEL_TIME**
  **Seconds** the average vehicle took to traverse the segment in that interval. Roughly `TRAVEL_TIME ‚âà segment_length / SPEED` (after converting units). Useful to derive segment length if you have a stable speed sample. ([Sito Ufficiale di New York City][2])

* **STATUS**
  Marked as an **artifact / not useful** in NYC DOT‚Äôs own metadata. Most people ignore it. ([Sito Ufficiale di New York City][2])

* **DATA_AS_OF** (a.k.a. `DataAsOf`)
  **Timestamp** when data for that link was last received. The feed updates **every few minutes**. Timezone is local (Eastern). Use this for time-series work and resampling. ([Sito Ufficiale di New York City][2])

* **LINK_POINTS**
  **Plaintext sequence of lat/long pairs** describing the link geometry (start‚Üíend polyline). **Caveat:** some values are **truncated**‚Äîdon‚Äôt rely on this alone for precise mapping. ([Medium][3])

* **ENCODED_POLY_LINE**
  **Google-encoded polyline** version of the same geometry. This is usually the better field to decode for maps. (See Google‚Äôs polyline spec referenced by DOT.) ([Sito Ufficiale di New York City][2])

* **ENCODED_POLY_LINE_LVLS**
  **Polyline ‚Äúlevels‚Äù** for Google‚Äôs legacy rendering (zoom levels). Often unused in modern tooling but included for completeness. ([Sito Ufficiale di New York City][2])

* **OWNER**
  Owner of the detector producing this link‚Äôs data (administrative/operational). ([Sito Ufficiale di New York City][2])

* **TRANSCOM_ID / TRANSCOM_ID (artifact)**
  Marked **not useful** by the publisher (redundant with ID). ([Sito Ufficiale di New York City][2])

* **BOROUGH**
  NYC borough name (**Brooklyn, Bronx, Manhattan, Queens, Staten Island**). It can be blank for some links. Handy for rollups and filtering. ([Sito Ufficiale di New York City][2])

* **LINK_NAME / DESCRIPTION**
  Human-readable description of the segment (e.g., ‚ÄúBQE N Atlantic Ave ‚Äî BKN Bridge Manhattan Side‚Äù). Note: **links are one-way**, and not every corridor has both directions in the feed. ([Medium][3])

### How to interpret the dataset (what a ‚Äúrow‚Äù is)

* One **segment (link)** √ó one **timestamp** ‚Üí **avg speed & travel time** for vehicles that **completed** that segment in the interval. It‚Äôs not per-vehicle data; it‚Äôs an **aggregate**. ([Medium][3])
* The feed is **real-time / near-real-time**, updated several times per minute, and covers **major arterials & highways** in NYC. ([Sito Ufficiale di New York City][2])

### Practical notes / gotchas

* **Geometry:** Prefer **`ENCODED_POLY_LINE`** over `LINK_POINTS`; the latter can be cut off. ([Medium][3])
* **Aggregation grain:** Links are **directional**; do not assume two-way coverage for a corridor. ([Medium][3])
* **Units:** SPEED = mph, TRAVEL_TIME = seconds; `BOROUGH` is a label, not a geometry. ([Sito Ufficiale di New York City][2])
* **Quality:** Occasional zeros/missing values; treat **STATUS** as ignorable. ([Sito Ufficiale di New York City][2])

### Typical uses

* Compute **p50/p90 speeds** by `BOROUGH`/`LINK_ID`/hour; detect slowdowns and incidents.
* Map segments by decoding **`ENCODED_POLY_LINE`**; join with borough boundaries for choropleths.
* Derive **segment length** via `median(SPEED)*median(TRAVEL_TIME)` (unit-converted) if length isn‚Äôt separately available.

[1]: https://www.kaggle.com/datasets/aadimator/nyc-realtime-traffic-speed-data?utm_source=chatgpt.com "NYC Real-Time Traffic Speed Data"
[2]: https://www.nyc.gov/html/dot/downloads/pdf/metadata-trafficspeeds.pdf?utm_source=chatgpt.com "Traffic Sensors Metadata What does this data set describe? ..."
[3]: https://medium.com/qri-io/new-qri-dataset-s-nyc-real-time-traffic-speeds-c3e4c88f44be "New Qri Dataset(s): NYC Real-Time Traffic Speeds | by Chris Whong | qri.io | Medium"

---


**NOTES on this size (about 30 GB)**

28 GB ‚âà 28,000,000,000 bytes ‚âà 26 GiB (if we look at it in ‚Äúcomputer‚Äù terms).

A CSV file is textual, so it‚Äôs not dense: the same data in **Parquet** would often fit in **3‚Äì6 GB**.

In RAM this file, <u>if read with *pandas*</u>, **takes up much more than 28 GB**: pandas in fact has to:

* read the text,
* parse it,
* create the internal arrays.

Result: 28 GB of CSV with *pandas* reading can become **50‚Äì80 GB of RAM** without trying too hard (it depends on how many string columns there are, how many NaNs, how long the labels are, etc.).

**Is such a size common in a company?**

* a single 28 GB CSV is not the norm in ERP/accounting/HR. In these systems we more often find **50‚Äì500 MB, at most 2‚Äì3 GB** when they export ‚Äúeverything‚Äù.
* however it is totally normal in contexts like:

  * application / web / security logs,
  * telco,
  * mobility / transportation (like your case),
  * IoT,
  * data lakes ‚Äúdumped‚Äù from a legacy system.

But‚Ä¶ companies almost never want to have a single 28 GB CSV. Usually it‚Äôs a ‚Äúbig dump‚Äù made like that because ‚Äúthat was the export option‚Äù, or because someone did SELECT * over 3 years and sent it to S3. In serious production you split by date or by partition and you go with Parquet.

**So: it‚Äôs not strange to have 28 GB of data. It‚Äôs a bit strange to have them all in a single CSV.**


## Determination of the execution environment

The notebook works **indifferently** both on Jupyter Notebook / Visual Studio Code and on Google Colab, as said, <u>except for two aspects</u>:

* loading the datasets into the notebook
* including the *png* images in the individual cells

It is therefore useful to **determine the execution environment**, setting a binary variable (to `True` if we are in Google Colab, to `False` if we are in Jupyter Notebook).

The two operations above will be performed differently depending on the value of the binary variable.


In [None]:
# setting the BINARY TOGGLE:
try:
    import google.colab                      # package available ONLY in Google Colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

print("Running on Colab:", IN_COLAB)


# IMPORT of the necessary packages (needed both in JN and in Colab):
from IPython.display import Image, display   # import of embedding and image display packages (one time)
                                             # Image and display are both needed in Jupyter Notebook
                                             # Google Colab uses only Image
import os                                    # needed in Google Colab to see from a code cell
                                             # the contents of 'content'


## Loading the big *csv* with Google Colab Pro

How much space do we have in the VM *session storage*? (with an **L4-GPU** runtime with 53GB of RAM, 22.5 GB of VRAM and 235.7 GB of disk)


In [None]:
if IN_COLAB:
    !df -h

`overlay 236G 40G 197G 17% /`: this is **the root** of the Colab environment, i.e. what in Colab we see under `/content`.<br>
What really matters is **Avail = 197G**.<br>

Translated: we can create new files up to about 197 GB (then of course it also depends on how much is used for notebooks, parquets, etc.).


What about the other lines (`/dev/root`, `tmpfs`, `/dev/shm‚Ä¶`) from the previous output?
They are things from the Colab container.

* `/dev/shm 26G` ‚Üí this is the shared memory (useful for multiprocessing, for example, but not to store 28 GB).
* `tmpfs 27G` ‚Üí temporary memories in RAM.
* they are not the right place to park a 28 GB CSV!


So we have a lot of local space in the VM: **about 197 GB free** ‚Üí so yes, a 28 GB file fits easily.

However, uploading such a large file to Google Colab *session storage* is slow and at risk of failure. Much better to put the file on Google Drive and then **mount the disk** (authorizing the Google connection with the usual steps):


In [None]:
if IN_COLAB:
    from google.colab import drive
    drive.mount("/content/drive", force_remount=True)  # argument 'force_remount = True' allows multiple mount 

This way the file stays on Drive, we don‚Äôt have to ‚Äúupload‚Äù it into the session, we read it from there (`/content/drive/MyDrive/.../big.csv`), and Colab doesn‚Äôt have to keep 28 GB on the local disk.

‚ö†Ô∏è Warning: reading 28 GB from Drive is slower than reading from local disk. For a huge CSV this can mean **minutes of I/O**.

NB. `drive/MyDrive` is now **also available under `content` in the session storage**.

In fact we can see it again by rerunning the `!df -h` command:


In [None]:
if IN_COLAB:
    !df -h

<u>Question</u>: but do ‚Äúoverlay‚Äù and ‚Äúdrive‚Äù have the same size (236G) ü§î?

Yes, it looks like that because Colab often **shows the same backing storage or in any case two volumes with a similar size**. What matters to us is: we have ~200 GB free locally and ~187 GB free on Drive ‚Üí both > 28 GB ‚Üí we‚Äôre safe.


First of all, as a check, let‚Äôs **list the files on the drive**:

In [None]:
if IN_COLAB:
    import os
    base = "/content/drive/MyDrive"

    for name in os.listdir(base):
        print(name)

If we also want to see the **size** of the files:

In [None]:
if IN_COLAB:
    for name in os.listdir(base):
        path = os.path.join(base, name)
        if os.path.isfile(path):
            print("FILE ", name, os.path.getsize(path))
        else:
            print("DIR  ", name)

Let‚Äôs check the size of the `DOT_Traffic_Speeds_NBE.csv` file:


In [None]:
if IN_COLAB:
    !ls -lh /content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv

Now we are ready to read the csv file with pandas (`pd.read_csv`) with **cuDF**.

---

**`cudf` is a CUDA-accelerated version of pandas.**

See [this post with video](https://www.linkedin.com/feed/update/urn:li:activity:7173982894921519105?utm_source=share&utm_medium=member_desktop) and [this article](https://www.blog.dailydoseofds.com/p/nvidias-latest-update-can-make-your).

*Steve Nouri*:<br>
**NVIDIA made Pandas 50x faster with No code change!**<br>

You simply need to do this:<br>

```python
    %load_ext cudf.pandas
    import pandas as pd
```

Now `cuDF` will be **integrated directly into Google Colab** (you obviously need to have a **GPU-enabled runtime**).

Also take a look here [https://bit.ly/3XX9pgm](https://bit.ly/3XX9pgm).

See also [this excellent video](https://www.youtube.com/watch?v=8X_IaCNpo7E) translated into Italian.

---


In [None]:
if IN_COLAB:
    %load_ext cudf.pandas
    import pandas as pd
    import cudf
else:
    import pandas as pd

The following **reading** loop takes **610 seconds** on L4-GPU.<br>
**The general idea**:

* we have a huge CSV **in blocks of 250k rows** with *pandas*,
* we read it piece by piece (`chunksize`),
* each piece we move to the GPU with cuDF,
* inside here (on the GPU) we could filter the rows or transform the file,
* if we want we save it right away or we accumulate it in a list,
* at the end we can concatenate everything on the GPU (only if it fits in VRAM),
* the whole code block below is timed to know how long it takes.


In [None]:
CHUNK = 200_000_000  # 200 MB at a time
offset = 0
part = 0

import time

# timer start-up
start_time = time.time()

# // code to be measured

# 1. file path
CSV_PATH = "/content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv"

# 2. chunk size (start low)
ROWS_PER_CHUNK = 250_000   # ~200-300 MB depending on the columns

# 3. if you want to accumulate the chunks in GPU (only if we can keep them)
gdf_parts = []

for i, chunk in enumerate(pd.read_csv(CSV_PATH, chunksize=ROWS_PER_CHUNK)):
    print(f"[pandas] letto chunk {i} con {len(chunk)} righe")

    # 4. convert the pandas chunk -> cuDF (here we use the GPU)
    gdf_chunk = cudf.from_pandas(chunk)
    print(f"[cuDF] chunk {i} in GPU con shape {gdf_chunk.shape}")

    # ‚¨áÔ∏è here we can do our operations in GPU
    # example: filter
    # gdf_chunk = gdf_chunk[gdf_chunk["BOROUGH"] == "MANHATTAN"]

    # example: we save immediately to parquet so as not to keep everything in GPU
    # gdf_chunk.to_parquet(f"/content/out_part_{i:04d}.parquet")

    # if instead we keep them to merge later:
    gdf_parts.append(gdf_chunk)

# 5. (optional) we merge all the GPU pieces into a single cuDF DataFrame
# ‚ö†Ô∏è do it only if it fits in VRAM
if gdf_parts:
    gdf_all = cudf.concat(gdf_parts, ignore_index=True)
    print(gdf_all.shape)

# // end of code to be measured

# end timer and print
end_time = time.time()
print ('Execution Total Time: ', end_time - start_time)


Let‚Äôs go line by line to see what the code in the previous cell does.

**Timer**

```python
import time
start_time = time.time()
```

where:

* we import the `time` package
* we store the start instant
* at the end we store the end instant to say ‚Äúthis whole run took X seconds‚Äù.

**Reading parameters**

```python
    CSV_PATH = "/content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv"
    ROWS_PER_CHUNK = 250_000
    gdf_parts = []
```

where:

* `CSV_PATH`: where the file is
* `ROWS_PER_CHUNK = 250_000`: instead of reading 65 million rows in one shot, we read **250k at a time**. <u>This is the right way when the CSV is huge</u>.
* `gdf_parts = []`: the empty list where we put the cuDF DataFrames as we convert them.

**Chunked reading with pandas**

```python
    for i, chunk in enumerate(pd.read_csv(CSV_PATH, chunksize=ROWS_PER_CHUNK)):
        print(f"[pandas] letto chunk {i} con {len(chunk)} righe")
```

Here something important happens:

* we are not reading directly with cuDF.
* we are using *pandas* with the option `chunksize=...` ‚Üí this makes the `read_csv` function become a **generator**: each loop of the `for` returns **only a piece of the file**.
* `i` is the index of the `chunk` (0, 1, 2, ‚Ä¶).
* `chunk` is a **pandas DataFrame** with ~250k rows.
* we print how many rows we have read to see that we are progressing.

Why do we do it like this? Because often pandas is more ‚Äútolerant‚Äù and chunked reading is already ready, and then we use cuDF only for the processing.

> <u>Note on `enumerate`</u><br>
> `enumerate` is a Python function that takes something **iterable** (list, generator, in this case the chunks of `read_csv`) and returns pairs:
>
> * the loop number (0, 1, 2, 3‚Ä¶)
> * the actual element of the iteration<br>
>
> That is, with `enumerate(...)` we are saying: ‚Äúfor each piece you give us, also give us the index of the piece‚Äù.
>
> * `i` ‚Üí 0 for the first chunk, 1 for the second, 2 for the third‚Ä¶
> * `chunk` ‚Üí **the pandas DataFrame with those 250,000 rows**<br>
>
> This way we can print:
>
> ```python
>     print(f"[pandas] letto chunk {i} ...")
> ```
>
> and we know where we are.

**pandas ‚Üí cuDF conversion**

```python
    gdf_chunk = cudf.from_pandas(chunk)
    print(f"[cuDF] chunk {i} in GPU con shape {gdf_chunk.shape}")
```

* we take the piece read in RAM (pandas),
* we move it to the GPU converting it to a cuDF DataFrame.
* we print the `shape` (the dimensions) just for checking.

This is the point where we ‚Äúuse the GPU‚Äù.

**Point where to do the (OPTIONAL) processing**

```python
    gdf_chunk = gdf_chunk[gdf_chunk["BOROUGH"] == "MANHATTAN"]
    gdf_chunk.to_parquet(...)
```

It‚Äôs the pattern ‚ÄúI read big CSVs ‚Üí I transform them in pieces ‚Üí I save them in a more convenient format‚Äù.

**Accumulating the pieces**

```python
    gdf_parts.append(gdf_chunk)
```

Instead of saving right away, we put the piece in a list.<br>
This is handy if:

* we want to do a concatenation at the end,
* or if the chunks are few and fit in memory.

It is risky if the CSV is really huge and the GPU has little VRAM.

**Final concatenation (optional)**

```python
    if gdf_parts:
        gdf_all = cudf.concat(gdf_parts, ignore_index=True)
        print(gdf_all.shape)
```

If we have at least one piece, we merge them all into a single cuDF DataFrame.

```python
    ignore_index=True because after a concat the old indexes don‚Äôt make sense.
```

‚ö†Ô∏è we rightly put the comment: ‚Äúdo it only if it fits in VRAM‚Äù.<br>
This is the part that often, on huge datasets, is not done, and we stop at saving per chunk.

**End timer**

```python
    end_time = time.time()
    print ('Tempo totale di esecuzione: ', end_time - start_time)
```

We print how long the whole run took: chunked reading + conversion + eventual concat.


The following version has another approach: ‚Äúdon‚Äôt accumulate, process and then discard‚Äù, it‚Äôs even safer (**770 seconds on L4-GPU**) üëá

In [None]:
import time
# timer start-up
start_time = time.time()

CSV_PATH = "/content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv"
ROWS_PER_CHUNK = 250_000

# creation of the directory on session storage
out_dir = "/content/parquet_parts"
os.makedirs(out_dir, exist_ok=True)

for i, chunk in enumerate(pd.read_csv(CSV_PATH, chunksize=ROWS_PER_CHUNK)):
    gdf = cudf.from_pandas(chunk)
    # do your operations here...
    # and then do NOT keep it in memory
    gdf.to_parquet(f"{out_dir}/part_{i:04d}.parquet")

# end timer and print
end_time = time.time()
print ('Total Execution Time: ', end_time - start_time)


`chunk` ‚Üí is the **pandas dataframe** that comes from the CSV.

`gdf` ‚Üí is the **cuDF dataframe** (the ‚Äúreal‚Äù one we work on in the GPU).

do we want to use pandas? ‚Üí we use `chunk`<br>
do we want to use cuDF ‚Üí we use `gdf` (this is ‚Äúthe dataframe‚Äù we care about for the GPU)


We want to count the rows:

In [None]:
if IN_COLAB:
    !wc -l /content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv

64,914,524 rows!

Knowing the number of rows of the file we can now write the most effective file-reading code (with `chunk` = 1,000,000):

In [None]:
if IN_COLAB:
    path = "/content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv"

    ROWS_PER_CHUNK = 1_000_000   # 1 million: ~65 cycles

    total = 0
    for i, chunk in enumerate(pd.read_csv(path, chunksize=ROWS_PER_CHUNK)):
        gdf = cudf.from_pandas(chunk)
        total += len(gdf)
        print(f"chunk {i:03d} -> {len(gdf)} righe, totale: {total}")

    print("‚úÖ totale letto:", total)

In [None]:
chunk.shape # the LAST chunk in memory

In [None]:
gdf.shape # l'ultimo 

First we read in chunks.<br>
Now let‚Äôs TRY to read the whole file **into a single dataframe in memory** (in 7 minutes on L4-GPU). There is the risk that **RAM blows up**, but here we have 53GB!


In [None]:
if IN_COLAB:
    dfs = []
    for chunk in pd.read_csv(path, chunksize=1_000_000):
        dfs.append(chunk)

    df = pd.concat(dfs, ignore_index=True)

In [None]:
df.shape

In [None]:
df.head()

If instead we wanted to read the first chunk again, we would do this:

In [None]:
```python
import pandas as pd

path = "/content/drive/MyDrive/DOT_Traffic_Speeds_NBE.csv"

reader = pd.read_csv(path, chunksize=1_000_000)  # creates the iterator
first_chunk = next(reader)                       # takes ONLY the first piece

first_chunk.head()
```


# The *parquet* format

Useremo la serie storica `usa_stocks_30m.parquet`: √® una serie OHLCV di 514 titoli del Nasdaq del NYSE (dal 1998 al 2024).

Il dataset con cui lavoreremo √® un sottoinsieme del dataset [**USA 514 Stocks Prices NASDAQ NYSE**](https://www.kaggle.com/datasets/olegshpagin/usa-stocks-prices-ohlcv/data), anche disponibile su [Kaggle](https://www.kaggle.com/datasets), composto da circa **36 milioni** di elementi.

Scarichiamo il dataset NON da Kaggle ma dal "Public Google Cloud Storage bucket" di NVIDIA, per garantire velocit√† di download maggiori.

Il "Public Google Cloud Storage bucket" di NVIDIA √® uno spazio online dove NVIDIA mette a disposizione file pubblici (come dataset, modelli, esempi di codice) che chiunque pu√≤ scaricare.
√à un po‚Äô come un grande armadio digitale aperto a tutti, ospitato su Google Cloud.

Questo download richiede **circa 60 secondi**:

In [60]:
# Download of the big time series
import urllib.request

file_path = "usa_stocks_30m.parquet"
url = "https://storage.googleapis.com/rapidsai/colab-data/usa_stocks_30m.parquet"

if not os.path.isfile(file_path):
    print(f"Scarico il file {file_path}...")
    urllib.request.urlretrieve(url, file_path)
    print("Download completato.")
else:
    print(f"{file_path} gi√† presente.")

usa_stocks_30m.parquet gi√† presente.


**The file `usa_stocks_30m.parquet`**<br>

The `usa_stocks_30m.parquet` file is a dataset provided by the RAPIDS (NVIDIA) team to give examples of analysis on big financial time series with GPU-accelerated libraries (like `cuDF`).

üìå Basically:

* It is a **Parquet format** file (columnar, compressed, very efficient for big data).
* It contains **US stock price** data (listed stocks) recorded at a **30-minute frequency**.
* It is meant for demos: time-series analysis, manipulation with pandas/cuDF, CPU vs GPU benchmarks.

üìä It typically includes:

* ticker ‚Üí the stock symbol (e.g. AAPL, MSFT).
* timestamp ‚Üí the date/time of the observation (every 30 min).
* open, high, low, close, volume (OHLCV) ‚Üí classic trading fields.

üìê Indicative sizes:

* About **36 million rows**,
* Size **~ 600‚Äì700 MB** in Parquet format,
* **If converted to CSV it would become much heavier (even several GB)**.


üëâ The Parquet format **is much more efficient than CSV**:

* it is binary and compressed (takes up less space);
* it is columnar ‚Üí pandas can read only the needed columns;
* it preserves data types (no inference every time).


In [61]:
df = pd.read_parquet("usa_stocks_30m.parquet")
df.head()

Unnamed: 0,datetime,open,high,low,close,volume,ticker
0,1999-11-18 17:00:00,45.56,50.0,45.5,46.0,9275000,A
1,1999-11-18 17:30:00,46.0,47.69,45.82,46.57,3200900,A
2,1999-11-18 18:00:00,46.56,46.63,41.0,41.0,3830500,A
3,1999-11-18 18:30:00,41.0,43.38,40.37,42.38,3688600,A
4,1999-11-18 19:00:00,42.31,42.44,41.56,41.69,1584300,A


In [62]:
df.shape

(36087094, 7)

# Pickle and Feather: little used?

* `Parquet`: default for large tabular data ‚Üí columnar, compressed, schema, partitionable, cross-language (Spark, DuckDB, BigQuery, etc.).
* `CSV`: human/‚Äúuniversal‚Äù exchange, but heavy and slow.
* `Feather (Arrow IPC file)`: super-fast for temporary passes between Python/R or for local caching; fewer features (no partitioning, no append, little ‚Äúschema evolution‚Äù).
* `Pickle`: Python only, unsafe to load if you don‚Äôt trust the source, fragile across versions; good for Python objects (sklearn models, lists), not for long-lived tabular ‚Äúdata‚Äù.

**Are `Feather` and `Pickle` ‚Äúlittle used‚Äù?**

`Pickle`

* üîí Security: pickle.load can execute code ‚Üí not recommended for shared files.
* üß¨ Low portability: Python-only and sometimes tied to versions/libraries.
* üì¶ Tabular data: it‚Äôs neither columnar nor efficiently compressed; no predicate pushdown.

`Feather`

* üß© Niche use: great for Python‚ÜîR/Arrow interop and ‚Äúfast‚Äù cache, but‚Ä¶
* üß± Fewer features: no partitioning, no append, worse handling of evolving ‚Äúdata lakes‚Äù.
* üåç Ecosystems: Spark/DB/Cloud tools push Parquet as the de facto standard.

**When they make sense**:

`Pickle`

* Quick snapshots of Python objects (e.g. sklearn pipelines) for internal and controlled use.
* Often better alternatives: joblib.dump for sklearn models; ONNX / PMML for portability; for tabular data ‚Üí Parquet.

`Feather`

* ‚ÄúFast I/O‚Äù local cache (e.g. intermediate save in a notebook).
* Fast Python‚ÜîR exchange (Arrow): pyarrow.feather / arrow::write_feather in R.

If you want maximum read/write speed on a single file and you don‚Äôt need to partition.


# JSON files

A **non-tabular** but **frequently used** file format in Python is JSON.

JSON files (extension *.json*) are one of the most used formats today to exchange data between applications, **especially on the web and in API contexts**.

üí° **In short**

* JSON stands for **JavaScript Object Notation**<br>
  It is a **textual** format, <u>human readable and easily interpretable by programs</u>, born from JavaScript but today used in **practically all languages** (Python, Java, C#, PHP, etc.).

üì¶ **Structure of a JSON file**

A JSON file contains data organized as [**key‚Äìvalue pairs**](https://en.wikipedia.org/wiki/Name%E2%80%93value_pair), for example:

```json
{
  "nome": "Antonio",
  "eta": 45,
  "iscritti": ["Mario", "Lucia", "Giorgio"],
  "attivo": true,
  "dettagli": {
    "ruolo": "Analista",
    "azienda": "ACI"
  }
}
```

üëÜ This is a JSON object, which contains:

* strings ("`Antonio`", "`Analista`")
* numbers (`45`)
* booleans (`true`)
* lists (arrays) (["`Mario`", "`Lucia`", "`Giorgio`"])
* nested objects ("`dettagli": {...}`)

**In Python**<br>
You can read or write JSON files easily with the json module:

```python
    import json

    # Reading
    with open("dati.json", "r") as f:
        dati = json.load(f)
    print(dati["nome"])  # -> Antonio

    # Writing
    nuovi_dati = {"linguaggio": "Python", "versione": 3.12}
    with open("config.json", "w") as f:
        json.dump(nuovi_dati, f, indent=4)
```

**Where is JSON used?**

* REST APIs (almost all modern APIs use JSON to exchange data)
* Configurations (e.g. package.json in Node.js)
* NoSQL databases like MongoDB (which uses BSON, a binary version of JSON)
* Web and mobile applications to pass data between frontend and backend

üÜö Quick comparison JSON vs CSV vs XML<br>
| Quick comparison |          |                     |                                   |
| ---------------- | -------- | ------------------- | --------------------------------- |
| **Format**       | **Type** | **Pro**             | **Con**                           |
| JSON             | Text     | Readable, universal | Not suitable for binary data      |
| CSV              | Text     | Simple for tables   | Does not handle nested structures |
| XML              | Text     | Very flexible       | More verbose and heavier          |



## Conversion of a CSV file or a Python dictionary to JSON and vice versa (i.e. full import/export)

1Ô∏è‚É£ **Python dictionary ‚Üí JSON (write)**

In [63]:
import json

dati = {
    "nome": "Antonio",
    "eta": 45,
    "linguaggi": ["Python", "SQL", "R"],
    "attivo": True
}

# writes on the JSON file
with open("dati.json", "w") as f:
    json.dump(dati, f, indent=4, ensure_ascii=False)

print("‚úÖ JSON file created!")

‚úÖ JSON file created!


**Result** (`dati.json`):

```json
{
    "nome": "Antonio",
    "eta": 45,
    "linguaggi": ["Python", "SQL", "R"],
    "attivo": true
}
```

üî∏ `indent=4` ‚Üí makes the file readable<br>
üî∏ `ensure_ascii=False` ‚Üí keeps accented characters


2Ô∏è‚É£ **JSON ‚Üí Python dictionary (read)**

In [64]:
import json

with open("dati.json", "r") as f:
    dati_letti = json.load(f)

print(dati_letti["nome"])     # Antonio
print(type(dati_letti))       # dict

Antonio
<class 'dict'>


3Ô∏è‚É£ **CSV ‚Üí JSON**

Let‚Äôs use the `Credit_ISLR.csv` file.<br>
Let‚Äôs convert it to JSON:


In [65]:
import csv
import json

with open("Credit_ISLR.csv", "r") as f_csv:
    reader = csv.DictReader(f_csv)
    dati = list(reader)

with open("Credit_ISLR.json", "w") as f_json:
    json.dump(dati, f_json, indent=4, ensure_ascii=False)

print("‚úÖ CSV converted to JSON!")

‚úÖ CSV converted to JSON!


4Ô∏è‚É£ **JSON ‚Üí CSV**

Now let‚Äôs do the reverse:


In [66]:
import json
import csv

with open("Credit_ISLR.json", "r") as f_json:
    dati = json.load(f_json)

with open("Credit_ISLR_out.csv", "w", newline="") as f_csv:
    writer = csv.DictWriter(f_csv, fieldnames=dati[0].keys())
    writer.writeheader()
    writer.writerows(dati)

print("‚úÖ JSON converted to CSV!")

‚úÖ JSON converted to CSV!


| üí¨ In summary | Library       | Key method                         |
| ------------- | ------------- | ---------------------------------- |
| dict ‚Üí JSON   | `json`        | `json.dump()`                      |
| JSON ‚Üí dict   | `json`        | `json.load()`                      |
| CSV ‚Üí JSON    | `csv`, `json` | `csv.DictReader()` + `json.dump()` |
| JSON ‚Üí CSV    | `json`, `csv` | `json.load()` + `csv.DictWriter()` |


# Technical note on PDFs

In VSC the rendering of PDF files is different from that of Jupyter Notebook/Lab and from that of Google Colab.<br>
The following `show_pdf` function detects which IDE is active and ‚Äúrenders‚Äù the PDF differently.


In [67]:
def show_pdf(pdf_path, width=1000, height=600):
    """
    Shows a PDF in the most appropriate way for the current environment:
    - In Jupyter: displays inline with IFrame.
    - In Colab: uses IFrame (handles uploaded files well).
    - In VS Code or other environments: opens in the default browser.
    """
    import os, webbrowser, sys
    from pathlib import Path

    pdf_path = Path(pdf_path)
    if not pdf_path.exists():
        raise FileNotFoundError(f"File not found: {pdf_path}")

    # Detect environment
    try:
        shell = get_ipython().__class__.__name__
    except NameError:
        shell = None

    if shell == 'ZMQInteractiveShell':  # Jupyter or Colab
        from IPython.display import IFrame, display
        display(IFrame(str(pdf_path), width=width, height=height))
    elif "vscode" in sys.executable.lower() or "vscode" in os.getcwd().lower():
        # VS Code environment ‚Üí open in browser
        webbrowser.open(pdf_path.resolve().as_uri())
        print(f"üìÇ PDF opened in browser: {pdf_path}")
    else:
        # Other environments (terminals, scripts)
        webbrowser.open(pdf_path.resolve().as_uri())
        print(f"üìÇ PDF opened in browser: {pdf_path}")
