# Ancora su Pandas
di Emiliano Citarella

## Modulo 5: Lavorare con i files
- Maggiori informazioni sull'uso `pd.read_csv`
    - HTTP Requests
    - Working with files that use delimiters/separators other than commas
    - Setting the index column
- Scrivere dati con `to_csv`
- Lettura JSON
- Lettura da file Excel
- Scrittura su file Excel

In [26]:
import pandas as pd

In [27]:
# read_csv può leggere da file CSV ospitati.
# Pandas invia la richiesta http!
url = "https://gist.githubusercontent.com/ryanorsinger/cc276eea59e8295204d1f581c8da509f/raw/2388559aef7a0700eb31e7604351364b16e99653/mall_customers.csv"
pd.read_csv(url).head()

Unnamed: 0,customer_id,gender,age,annual_income,spending_score
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [28]:
# per impostare la colonna dell'indice, usa l'argomento index_col
# se noti una colonna che ha senso usare come indice, dovrai specificare
pd.read_csv(url, index_col="customer_id").head()

Unnamed: 0_level_0,gender,age,annual_income,spending_score
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Male,19,15,39
2,Male,21,15,81
3,Female,20,16,6
4,Female,23,16,77
5,Female,31,17,40


In [29]:
# ! L'operatore all'interno di Jupyter Notebooks o iPython emette un comando al terminale
# Se usi Windows senza il sottosistema Linux abilitato, usa !dir *.csv
!ls *.csv

2020_sales.csv      all_sales.csv       mpg.csv             tips.csv
2021_sales.csv      all_sales_clean.csv penguins.csv
2022_sales.csv      maintenance.csv     quotes.csv


In [30]:
!ls *sales*.csv
# Il comando !ls viene utilizzato per elencare i file nella directory corrente.
# Il pattern *sales*.csv filtra solo i file CSV che contengono la parola sales nel loro nome.

2020_sales.csv      2022_sales.csv      all_sales_clean.csv
2021_sales.csv      all_sales.csv


In [31]:
sales_files = !ls *sales*.csv
sales_files
# Assegnazione dei file a una variabile:
# Usa !ls per elencare i file e assegna il risultato alla variabile sales_files.
# sales_files è ora una lista di stringhe, dove ogni stringa rappresenta il nome di un file CSV 
# trovato nella directory.

['2020_sales.csv',
 '2021_sales.csv',
 '2022_sales.csv',
 'all_sales.csv',
 'all_sales_clean.csv']

In [32]:
# Lettura a livello di codice di più file 
sales_data = []
# sales_data è una lista vuota che verrà utilizzata per raccogliere i DataFrame caricati da ciascun file CSV.
for file in sales_files:
    df = pd.read_csv(file)
    sales_data.append(df)
    
#Il ciclo for itera su ogni file presente nella lista sales_files.
# Per ogni file: pd.read_csv(file):Legge il file CSV e lo carica in un DataFrame Pandas.
# sales_data.append(df):Aggiunge il DataFrame appena creato alla lista sales_data.
# Dopo il loop, sales_data sarà una lista di DataFrame, uno per ogni file CSV.
    
sales_df = pd.concat(sales_data, ignore_index=True)
sales_df

# Usa pd.concat() per unire tutti i DataFrame presenti in sales_data in un unico DataFrame, sales_df.
# ignore_index=True:Riassegna un nuovo indice incrementale al DataFrame risultante,
# ignorando gli indici originali dei DataFrame sorgenti.

Unnamed: 0.1,year,items,units,Unnamed: 0
0,2020,trucks,20,
1,2020,sedans,15,
2,2020,compact vehicles,14,
3,2021,trucks,35,
4,2021,sedans,30,
5,2021,compact vehicles,17,
6,2022,trucks,40,
7,2022,sedans,31,
8,2022,compact vehicles,35,
9,2020,trucks,20,0.0


In [33]:
sales_files = !ls *sales.csv
sales_files

['2020_sales.csv', '2021_sales.csv', '2022_sales.csv', 'all_sales.csv']

In [34]:
sales_data = []
for file in sales_files:
    df = pd.read_csv(file)
    sales_data.append(df)
    
sales_df = pd.concat(sales_data, ignore_index=True)
sales_df

Unnamed: 0.1,year,items,units,Unnamed: 0
0,2020,trucks,20,
1,2020,sedans,15,
2,2020,compact vehicles,14,
3,2021,trucks,35,
4,2021,sedans,30,
5,2021,compact vehicles,17,
6,2022,trucks,40,
7,2022,sedans,31,
8,2022,compact vehicles,35,
9,2020,trucks,20,0.0


In [35]:
# È comune sul campo combinare molte fonti di dati diverse in un unico dataframe per la pulizia/analisi
# Scrivendo to_csv scriverà i valori dell'indice nella propria colonna sui dati
sales_df.to_csv("all_sales.csv")

In [36]:
!ls *.csv

2020_sales.csv      all_sales.csv       mpg.csv             tips.csv
2021_sales.csv      all_sales_clean.csv penguins.csv
2022_sales.csv      maintenance.csv     quotes.csv


In [37]:
# Nota come la colonna rimanente viene trasformata in una colonna senza nome
pd.read_csv("all_sales.csv").head()

Unnamed: 0.2,Unnamed: 0.1,year,items,units,Unnamed: 0
0,0,2020,trucks,20,
1,1,2020,sedans,15,
2,2,2020,compact vehicles,14,
3,3,2021,trucks,35,
4,4,2021,sedans,30,


In [38]:
# Vediamo un esempio in cui evitiamo questa complicazione prestando maggiore attenzione all'indice
# L'argomento indice su to_csv prende un booleano e di default è True
sales_df.to_csv("all_sales_clean.csv", index=False)

In [48]:
# Si noti che l'indice è rigenerato ed è appropriato
pd.read_csv("all_sales_clean.csv")

Unnamed: 0.1,year,items,units,Unnamed: 0
0,2020,trucks,20,
1,2020,sedans,15,
2,2020,compact vehicles,14,
3,2021,trucks,35,
4,2021,sedans,30,
5,2021,compact vehicles,17,
6,2022,trucks,40,
7,2022,sedans,31,
8,2022,compact vehicles,35,
9,2020,trucks,20,0.0


Se usi una colonna di indice denominata invece di solo l'indice generato automaticamente, lo eviterai.

### Nota sui caratteri separatore, chiamati delimitatori
- I file CSV usano le virgole per separare i valori
- Potresti incontrare file che utilizzano un carattere delimitatore diverso da una virgola
- I file separati da tabulazioni sono comuni nei file di registro e nelle esportazioni di fogli di calcolo
- A volte, potresti incontrare un'estensione di file di .tsv per tab-separated-values
- Potresti incontrare delimitatori diversi da virgole o schede nei file di testo normale.
- Utilizza `pd.read_csv`Per loro (a meno che il file non sia . JSON), e identifica il carattere appropriato

In [49]:
pd.read_csv("penguins_with_tabs.tsv", sep="\t").head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [51]:
curie_quotes=pd.read_json("https://jsonplaceholder.typicode.com/comments")
curie_quotes

Unnamed: 0,postId,id,name,email,body
0,1,1,id labore ex et quam laborum,Eliseo@gardner.biz,laudantium enim quasi est quidem magnam volupt...
1,1,2,quo vero reiciendis velit similique earum,Jayne_Kuhic@sydney.com,est natus enim nihil est dolore omnis voluptat...
2,1,3,odio adipisci rerum aut animi,Nikita@garfield.biz,quia molestiae reprehenderit quasi aspernatur\...
3,1,4,alias odio sit,Lew@alysha.tv,non et atque\noccaecati deserunt quas accusant...
4,1,5,vero eaque aliquid doloribus et culpa,Hayden@althea.biz,harum non quasi et ratione\ntempore iure ex vo...
...,...,...,...,...,...
495,100,496,et occaecati asperiores quas voluptas ipsam no...,Zola@lizzie.com,neque unde voluptatem iure\nodio excepturi ips...
496,100,497,doloribus dolores ut dolores occaecati,Dolly@mandy.co.uk,non dolor consequatur\nlaboriosam ut deserunt ...
497,100,498,dolores minus aut libero,Davion@eldora.net,aliquam pariatur suscipit fugiat eos sunt\nopt...
498,100,499,excepturi sunt cum a et rerum quo voluptatibus...,Wilburn_Labadie@araceli.name,et necessitatibus tempora ipsum quaerat invent...


## Esempio di utilizzo di `read_clipboard`

|     model |             displ | year |  cyl | trans |        drv |  cty |  hwy |   fl | drv   | class   |
| --------: | ----------------: | ---: | ---: | ----: | ---------: | ---: | ---: | ---: | ----: | ------- |
|      audi |                a4 |  2.0 | 2008 |     4 |   auto(av) |    f |   21 |   30 |     p | compact |
|     dodge | dakota pickup 4wd |  3.9 | 1999 |     6 | manual(m5) |    4 |   14 |   17 |     r | pickup  |
|    toyota |       4runner 4wd |  4.7 | 2008 |     8 |   auto(l5) |    4 |   14 |   17 |     r | suv     |
|     dodge |       caravan 2wd |  3.8 | 2008 |     6 |   auto(l6) |    f |   16 |   23 |     r | minivan |
| chevrolet |            malibu |  3.6 | 2008 |     6 |   auto(s6) |    f |   17 |   26 |     r | midsize |


In [54]:
df=pd.read_clipboard()
df 

Unnamed: 0,model,displ,year,cyl,trans,drv,cty,hwy,fl,drv.1,class
0,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
1,dodge,dakota pickup 4wd,3.9,1999,6,manual(m5),4,14,17,r,pickup
2,toyota,4runner 4wd,4.7,2008,8,auto(l5),4,14,17,r,suv
3,dodge,caravan 2wd,3.8,2008,6,auto(l6),f,16,23,r,minivan
4,chevrolet,malibu,3.6,2008,6,auto(s6),f,17,26,r,midsize


In [55]:
# Writing a dataframe in memory to an excel file
df.to_excel("mpg.xlsx", index=None)

In [56]:
# Reading an excel file (simple version)
mpg = pd.read_excel("mpg.xlsx")

In [23]:
mpg

Unnamed: 0,year,items,units
0,2020,trucks,20
1,2020,sedans,15
2,2020,compact vehicles,14
3,2021,trucks,35
4,2021,sedans,30
5,2021,compact vehicles,17
6,2022,trucks,40
7,2022,sedans,31
8,2022,compact vehicles,35


In [57]:
# Leggere un foglio specifico da un file excel
pd.read_excel("example_spreadsheet.xlsx", sheet_name="grocery_list")

Unnamed: 0,item,price,quantity
0,cat foot,3.99,2
1,toilet paper,7.99,1
2,beans,0.99,2
3,corn,0.75,4


In [58]:
# Nota come c'è qualcosa in più
pd.read_excel("example_spreadsheet.xlsx", sheet_name="pet_info")

Unnamed: 0,Fancy Company Reports,Unnamed: 1,Unnamed: 2
0,Quality service for all your pet needs,,
1,Company Motto: when your companion is in need,,
2,,,
3,Pet Name,Species,weight
4,Fluffy,cat,8
5,Max,dog,18
6,Gus,iguana,12


In [59]:
# A volte, potrebbe essere necessario aprire il foglio di calcolo per identificare le colonne da saltare
pd.read_excel("example_spreadsheet.xlsx", sheet_name="pet_info", skiprows=4)

Unnamed: 0,Pet Name,Species,weight
0,Fluffy,cat,8
1,Max,dog,18
2,Gus,iguana,12


## Risorse aggiuntive
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
- Other formats https://pandas.pydata.org/docs/user_guide/io.html
    - SQL
    - XML
    - STATA
    - SAS
    - SPSS

## Exercises
- Use `pd.read_json` to read the Dolly Parton quotes into a dataframe named `dolly`. Dolly Parton quotes:
https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/4c0eef2e4cbce5e47b674e8d1d5bad34f0c7b757/dolly.json



- Read the Bob Ross quotes into a dataframe named `bob`. Bob Ross quotes in JSON: https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/bob_ross.json


- Make a dictionary using the keys "quote" and "author" and provide a quote of your choice. Be sure to wrap the dictionary in square brackets. Use `pd.DataFrame` to turn this list containing the single dictionary into a one row dataframe. Name your new dataframe `my_quote`.


- Next, use `pd.concat` to combine all three dataframes together in a new variable named `quotes`. 


- Use `to_csv` to write the `quotes` dataframe to disk, providing the file name `quotes.csv`.


- Read this drinks JSON into a dataframe called `drinks`
https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drinks.json


- Now, read in the beverage cost CSV into a dataframe called `drink_costs`
https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drink_cost.csv


- Combine these dataframes, and overwrite the dataframe called `drinks` using `pd.concat` 


- Finally, write your `drinks` dataframe to disk using `.to_excel`. Name the file `drinks.xlsx`.

In [60]:
dolly = pd.read_json("https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/4c0eef2e4cbce5e47b674e8d1d5bad34f0c7b757/dolly.json")
dolly

Unnamed: 0,quote,author
0,"We cannot direct the wind, but we can adjust t...",Dolly Parton
1,Find out who you are and do it on purpose,Dolly Parton
2,"If you don't like the road you're walking, sta...",Dolly Parton
3,You'll never do a whole lot unless you're brav...,Dolly Parton
4,I'm not going to limit myself just because peo...,Dolly Parton
5,I think everybody has the right to be who they...,Dolly Parton


In [62]:
bob = pd.read_json("https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/bob_ross.json")
bob

Unnamed: 0,quote,author
0,"We don't make mistakes, just happy little acci...",Bob Ross
1,"Talent is a pursued interest. In other words, ...",Bob Ross
2,"Anything that you try and you don't succeed, i...",Bob Ross


In [63]:
my_quote = pd.DataFrame([
    {
        "quote": "The bridge is not supported by one stone or anothers, \
        but by the line of the arch that they form", 
        "author": "Italo Calvino"
    }
])

my_quote

Unnamed: 0,quote,author
0,The bridge is not supported by one stone or an...,Italo Calvino


In [64]:
quotes = pd.concat([dolly, bob, my_quote])
quotes

Unnamed: 0,quote,author
0,"We cannot direct the wind, but we can adjust t...",Dolly Parton
1,Find out who you are and do it on purpose,Dolly Parton
2,"If you don't like the road you're walking, sta...",Dolly Parton
3,You'll never do a whole lot unless you're brav...,Dolly Parton
4,I'm not going to limit myself just because peo...,Dolly Parton
5,I think everybody has the right to be who they...,Dolly Parton
0,"We don't make mistakes, just happy little acci...",Bob Ross
1,"Talent is a pursued interest. In other words, ...",Bob Ross
2,"Anything that you try and you don't succeed, i...",Bob Ross
0,The bridge is not supported by one stone or an...,Italo Calvino


In [65]:
quotes.to_csv("quotes.csv", index=False)