## Introduzione a Pandas

Pandas è una libreria completa per il trattamento e l'analisi di dati multidimensionali. Essa si basa su due strutture dati principali che sono la `Series` e il `DataFrame`. Una `Series` rappresenta un array 1D di dati omogenei nel tipo, mentre un `DataFrame` è un'organizzazione bidimensionale di dati organizzati per colonne di tipo eterogeneo.

Le colonne possono essere rimosse ed inserite e, in generale un dataframe consente di effettuare semplicemente operazioni di groupby, merge, slicing, indicizzazione dei dati, reshaping e consente di gestire caricamento e salvataggio in `csv`, `json`, `SQL`, `HTML`, `Excel` ed altri.

Sui dataframe sono disponibili operazioni matematiche e statistiche che si applicano su tutte le celle o su un sottoinsieme specificato dall'utente.

In [1]:
import numpy as np
import pandas as pd

# creiamo un semplice dataframe da un dict di liste, ognuna delle quali sarà una serie

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)

purchases


Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


In [2]:
# inseriamo una lista di indici per i nostri dati

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


In [3]:
# individuiamo la riga indicizzata da 'June'

purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

In [4]:
# lettura da file json

people = pd.read_json('./Data/persone.json',orient='records')
people

Unnamed: 0,Cognome e Nome,Data di nascita,Luogo di nascita,Provincia,CF
0,Mario Rossi,03/08/1933,Palermo,PA,XXXTTT33R44R675E
1,Giuseppe Verdi,12/05/1971,Monza,MI,GGGVVV44U66M456P
2,Carlo Bianchi,30/09/2000,Macerata,MC,CCCBBB89R12O984F


In [5]:
# lettura da file csv

titanic = pd.read_csv('./Data/titanic.csv',index_col=0)

titanic

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Per leggere da un database MySQL è necessario installare il driver `mysql-connector-python` ovvero il driver `pymysql` e inoltre `sqlalchemy` che gestisce connessioni a engine differenti e consente la creazione di ORM.

In [6]:
# lettura da database

import pymysql
from sqlalchemy import create_engine

db_connection_str = 'mysql+pymysql://mysqluser:mysqlpassword@localhost:3306/test_db'
mydb = create_engine(db_connection_str)

people = pd.read_sql("SELECT * FROM people", con=mydb, index_col='PersonID')

people

Unnamed: 0_level_0,name,surname,birthdate,birthplace,taxid
PersonID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Mario,Rossi,1987-03-11,Roma,MRIRSS87C11A001W
2,Giovanni,Bianchi,1961-05-27,Milano,BNCGVN61E27D002Q
3,Francesca,Verdi,2005-11-18,Palermo,VRDFRC05T58G273D
4,Giorgio,Giorgi,1945-04-25,Napoli,GRGGRG45D25F235G


In [7]:
# Salviamo un dataframe sul database
titanic.to_sql('passengers',mydb,if_exists='replace')



In [8]:
# recuperiamo le donne sopravvissute sul Titanic
survived_females = pd.read_sql("SELECT PassengerId,Name FROM passengers WHERE Survived=1 AND Sex='female'", con=mydb, index_col='PassengerId')

survived_females

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
2,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,"Heikkinen, Miss. Laina"
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
10,"Nasser, Mrs. Nicholas (Adele Achem)"
...,...
875,"Abelson, Mrs. Samuel (Hannah Wizosky)"
876,"Najib, Miss. Adele Kiamie ""Jane"""
880,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)"
881,"Shelley, Mrs. William (Imanita Parrish Hall)"


In [9]:
# Visualizzazione veloce e info sui dati

movies_df = pd.read_csv("./Data/IMDB-Movie-Data.csv", index_col="Title")

movies_df.head(15)



Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0
The Lost City of Z,9,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [10]:
movies_df.tail(2)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
Nine Lives,1000,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


In [11]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object 
 3   Director            1000 non-null   object 
 4   Actors              1000 non-null   object 
 5   Year                1000 non-null   int64  
 6   Runtime (Minutes)   1000 non-null   int64  
 7   Rating              1000 non-null   float64
 8   Votes               1000 non-null   int64  
 9   Revenue (Millions)  872 non-null    float64
 10  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


In [12]:
temp_df = movies_df.append(movies_df)  # crea una copia poiché un dataframe è immutabile


temp_df.shape

(2000, 11)

In [13]:
temp_df.drop_duplicates(inplace=True)

temp_df.shape

(1000, 11)

In [14]:
temp_df = movies_df.append(movies_df)

temp_df.drop_duplicates(inplace=True,keep=False)

temp_df.shape

(0, 11)

In [15]:
# accediamo ai nomi delle colonne e rinominiamole

movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [16]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)


movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

In [17]:
movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

In [18]:
# Gestione dei valori nulli
# isnull() genera un dataframe di booleani in cui 
# ogni cella è False se il valore non è nullo

movies_df.isnull()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,False,False,False,False,False,False,False,False,False,True,False
Hostel: Part II,False,False,False,False,False,False,False,False,False,False,False
Step Up 2: The Streets,False,False,False,False,False,False,False,False,False,False,False
Search Party,False,False,False,False,False,False,False,False,False,True,False


In [19]:
# sommando su tutte le colonne si ottengono i valori nulli 
movies_df.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

In [20]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              1000 non-null   int64  
 1   genre             1000 non-null   object 
 2   description       1000 non-null   object 
 3   director          1000 non-null   object 
 4   actors            1000 non-null   object 
 5   year              1000 non-null   int64  
 6   runtime           1000 non-null   int64  
 7   rating            1000 non-null   float64
 8   votes             1000 non-null   int64  
 9   revenue_millions  872 non-null    float64
 10  metascore         936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


In [21]:
# Rimozione delle righe con valori nulli

movies_df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Index: 838 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              838 non-null    int64  
 1   genre             838 non-null    object 
 2   description       838 non-null    object 
 3   director          838 non-null    object 
 4   actors            838 non-null    object 
 5   year              838 non-null    int64  
 6   runtime           838 non-null    int64  
 7   rating            838 non-null    float64
 8   votes             838 non-null    int64  
 9   revenue_millions  838 non-null    float64
 10  metascore         838 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 78.6+ KB


In [22]:
# Pandas usa un concetto di 'asse' per indicare le dimensioni della tabella
# in questo modo rimuoviamo le colonne che contengono valori nulli

movies_df.dropna(axis=1).info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rank         1000 non-null   int64  
 1   genre        1000 non-null   object 
 2   description  1000 non-null   object 
 3   director     1000 non-null   object 
 4   actors       1000 non-null   object 
 5   year         1000 non-null   int64  
 6   runtime      1000 non-null   int64  
 7   rating       1000 non-null   float64
 8   votes        1000 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 118.1+ KB


In [23]:
# Gli 'assi' sono le liste degli indici di riga e di colonna

movies_df.axes

[Index(['Guardians of the Galaxy', 'Prometheus', 'Split', 'Sing',
        'Suicide Squad', 'The Great Wall', 'La La Land', 'Mindhorn',
        'The Lost City of Z', 'Passengers',
        ...
        'Underworld: Rise of the Lycans', 'Taare Zameen Par',
        'Take Me Home Tonight', 'Resident Evil: Afterlife', 'Project X',
        'Secret in Their Eyes', 'Hostel: Part II', 'Step Up 2: The Streets',
        'Search Party', 'Nine Lives'],
       dtype='object', name='Title', length=1000),
 Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
        'rating', 'votes', 'revenue_millions', 'metascore'],
       dtype='object')]

In [24]:
# Operazioni di slicing e di subset

genre_col = movies_df['genre'] # slice del dataframe su una colonna --> otteniamo una serie

type(genre_col)



pandas.core.series.Series

In [25]:
genre_col

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
                                     ...           
Secret in Their Eyes            Crime,Drama,Mystery
Hostel: Part II                              Horror
Step Up 2: The Streets          Drama,Music,Romance
Search Party                       Adventure,Comedy
Nine Lives                    Comedy,Family,Fantasy
Name: genre, Length: 1000, dtype: object

In [26]:
genre_col = movies_df[['genre','rating']] # subset del dataframe

type(genre_col)

pandas.core.frame.DataFrame

In [27]:
genre_col.head()

Unnamed: 0_level_0,genre,rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1
Prometheus,"Adventure,Mystery,Sci-Fi",7.0
Split,"Horror,Thriller",7.3
Sing,"Animation,Comedy,Family",7.2
Suicide Squad,"Action,Adventure,Fantasy",6.2


In [37]:
# Filtriamo applicando un'esplicita funzione alla colonna dei generi
# per estrarre una descrizione che contenga il genere 'Comedy'
# e poi richiediamo solo un rating maggiore di una certa soglia
# restituiamo solo il subset contenente anno e regista del film

select_high_rank = movies_df[movies_df['genre'].apply(lambda x: (x.find('Comedy') > -1)) & (movies_df['rating'] >= 7.5)][['year','director']]

select_high_rank

Unnamed: 0_level_0,year,director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
La La Land,2016,Damien Chazelle
Moana,2016,Ron Clements
Deadpool,2016,Tim Miller
Kingsman: The Secret Service,2014,Matthew Vaughn
Zootopia,2016,Byron Howard
Captain Fantastic,2016,Matt Ross
The Wolf of Wall Street,2013,Martin Scorsese
The Big Short,2015,Adam McKay
Superbad,2007,Greg Mottola
Scott Pilgrim vs. the World,2010,Edgar Wright


In [41]:
# Cerchiamo adesso il regista con più film nella categoria Comedy ad alto ranking

best_director=select_high_rank.groupby('director').count()

best_director.idxmax()

year    Wes Anderson
dtype: object

In [30]:
# Analizziamo i nostri dati al fine di effettuare 
# un'imputazione più coerente

analysis=movies_df.describe() # describe fornisce una serie di misure statistiche di base dei nostri dati

analysis


Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [31]:
# notiamo alta correlazione tra voti e revenue

movies_df.corr()

Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
rank,1.0,-0.261605,-0.221739,-0.219555,-0.283876,-0.271592,-0.191869
year,-0.261605,1.0,-0.1649,-0.211219,-0.411904,-0.12679,-0.079305
runtime,-0.221739,-0.1649,1.0,0.392214,0.407062,0.267953,0.211978
rating,-0.219555,-0.211219,0.392214,1.0,0.511537,0.217654,0.631897
votes,-0.283876,-0.411904,0.407062,0.511537,1.0,0.639661,0.325684
revenue_millions,-0.271592,-0.12679,0.267953,0.217654,0.639661,1.0,0.142397
metascore,-0.191869,-0.079305,0.211978,0.631897,0.325684,0.142397,1.0


In [32]:
# calcoliamo la media degli incassi dei film che si trovano
# nello stesso quartile dei voti del film di cui stiamo
# stimando l'incasso

# facciamo un binning dei voti rispetto agli intervalli definiti dai quartili
# che troviamo nel dataframe di analisi
buckets= pd.cut(movies_df['votes'],
       [analysis.loc['min','votes'],
        analysis.loc['25%','votes'],
        analysis.loc['50%','votes'],
        analysis.loc['75%','votes'],
        analysis.loc['max','votes']])

# aggiungiamo una nuova colonna con questi intervalli
movies_df['vote_quartile']=buckets                        

In [33]:
# raggruppiamo il dataframe per intervallo di appartenenza e calcoliamo la media
# sulla colonna della revenue
average_rev=movies_df.groupby(['vote_quartile'])[['revenue_millions']].mean()

# rinominiamo la colonna 'revenue_millions'
average_rev.rename(columns={'revenue_millions':'average_revenue'},inplace=True)
average_rev

Unnamed: 0_level_0,average_revenue
vote_quartile,Unnamed: 1_level_1
"(61.0, 36309.0]",12.725035
"(36309.0, 110799.0]",38.45186
"(110799.0, 239909.75]",77.102033
"(239909.75, 1791916.0]",172.003185


In [34]:
# effettuiamo l'inner join dei due dataframe sulla colonna 'vote_quartile'
# ottenendo che ogni riga del dataframe dei film ha un valore di revenue media
# corrispondente al proprio intervallo

pd.merge(movies_df,average_rev,on='vote_quartile',how='inner')

Unnamed: 0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore,vote_quartile,average_revenue
0,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"(239909.75, 1791916.0]",172.003185
1,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"(239909.75, 1791916.0]",172.003185
2,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"(239909.75, 1791916.0]",172.003185
3,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0,"(239909.75, 1791916.0]",172.003185
4,13,"Action,Adventure,Sci-Fi",The Rebel Alliance makes a risky move to steal...,Gareth Edwards,"Felicity Jones, Diego Luna, Alan Tudyk, Donnie...",2016,133,7.9,323118,532.17,65.0,"(239909.75, 1791916.0]",172.003185
...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,985,"Adventure,Family",A Malinois dog that helped American Marines in...,Boaz Yakin,"Thomas Haden Church, Josh Wiggins, Luke Kleint...",2015,111,6.8,21405,42.65,47.0,"(61.0, 36309.0]",12.725035
995,988,"Drama,Romance",The story of a privileged girl and a charismat...,Shana Feste,"Gabriella Wilde, Alex Pettyfer, Bruce Greenwoo...",2014,104,6.3,33688,23.39,30.0,"(61.0, 36309.0]",12.725035
996,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0,"(61.0, 36309.0]",12.725035
997,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0,"(61.0, 36309.0]",12.725035


In [43]:
# Tabelle pivot

# Supponiamo di voler calcolare la media dei sopravvissuti del Titanic
# per genere e per classe di imbarco

titanic.pivot_table(index='Sex',columns='Pclass',values='Survived',aggfunc='mean',margins=True)

Pclass,1,2,3,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838
