<h1>PANDAS DATA FRAMES</h1>

<p>It's a table (looks like excel) common to create Pandas data out of CSV files</p>

In [105]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

<p>Creating <b>DataFrame's</b> manually can be tedious. 99% of the time you'll be pulling the data from a database, csv, or the web. You can still create a DataFrame by specifying the COLS and VALS:</p>

In [3]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387,2833687,3874437,2167744,4602367,2950039,17348075],
    'Surface Area': [9984670,640679,357114,301336,377930,242495,9525067],
    'HDI': [.913,.888,.916,.873,.891,.907,.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe', 'Asia', 'Europe', 'America']
})

<b>NOTE:</b> You create a dataframe like you would a dictionary.<br>
In this case:<br>
<li>1. The <b>KEYS</b> serve as the COLUMN headers...</li>
<li>2. The <b>VALUES</b> will be lists of values</li>

Ex. 
df = pd.DataFrame( { <b>'KEY 1':</b> [Value 1, Value 2, Value n...], <b>'KEY 2':</b> [Value 1, Value 2, Value n...] } )

In [99]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
Canada,35.47,1785387,9984670,0.91,50339.39
France,63.95,2833687,640679,0.89,44310.28
Germany,80.94,3874437,357114,0.92,47868.01
Italy,60.66,2167744,301336,0.87,35733.03
Japan,127.06,4602367,377930,0.89,36221.71
United Kingdom,64.51,2950039,242495,0.91,45729.24
United States,318.52,17348075,9525067,0.92,54464.12


<b>Assiging an index with customized names to the DataFrame</b>

In [6]:
df.index = [
    'Canada', 
    'France',
    'Germany', 
    'Italy', 
    'Japan',
    'United Kingdom',
    'United States'
]

In [7]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [8]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [9]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

In [10]:
df.info() #Tells you columns, dtypes, and non-null values

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [11]:
df.size # Total Elements

35

In [12]:
df.shape # 7 rows/5 cols

(7, 5)

In [18]:
df.describe() #Most commonly performed when cleaning data to gain insights into the data

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [14]:
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [17]:
df.dtypes.value_counts()

float64    2
int64      2
object     1
dtype: int64

<h3>Indexing, Selection, and Slicing</h3>
Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a <b>Series</b>

In [20]:
df['Population']

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

<b>NOTE:</b> The index of the returned SERIES is the same as the DataFrame one.<br>Its NAME is the name of the column.<br>If you're working in a notebook and want to see a more DataFrame-like format, you can use the <b>to_frame</b> method. 

In [21]:
df['Population'].to_frame()

Unnamed: 0,Population
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


Multiple columns can also be selected similarly to numpy and SERIES:

In [22]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


<p>In this case, the result is another DataFrame. Slicing works differently; it acts at <u>row level</u>, and can be counter intuitive:</p>

In [23]:
df[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


Row level selection works better with loc and iloc <u>which are recommended</u> over regular 'direct slicing (df[:]).<br>
*<b>loc</b> gets you the entire row by name<br>
*<b>iloc</b> gets you the entire row by index position

In [28]:
# loc selects rows matching the given index
df.loc['Italy']

Population       60.665
GDP             2167744
Surface Area     301336
HDI               0.873
Continent        Europe
Name: Italy, dtype: object

In [39]:
df.loc['France': 'Italy'] #Getting row by slicing and using iloc

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe


As a second 'argument' you can pass the column(s) you'd like to select:

In [30]:
df.loc['France':'Italy', 'Population']

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64

In [29]:
df.loc['France': 'Italy', ['Population', 'GDP']]

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744


<h4>Multi-Indexing and Slicing</h4>
<b>REMEMBER:</b> Slicing notation is <b>[row, col]</b> and steps are notated with a colon (<b>:</b>)

In [31]:
df.iloc[0]

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

In [34]:
df.iloc[-1]

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object

In [36]:
df.iloc[[0,1,-1]] ## Multi-Indexing requires two brackets???

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
United States,318.523,17348075,9525067,0.915,America


In [38]:
df.iloc[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


In [40]:
df.iloc[1:3, 3] 

France     0.888
Germany    0.916
Name: HDI, dtype: float64

In [41]:
df.iloc[1:3, [0,3]]

Unnamed: 0,Population,HDI
France,63.951,0.888
Germany,80.94,0.916


In [42]:
df.iloc[1:3, 1:3]

Unnamed: 0,GDP,Surface Area
France,2833687,640679
Germany,3874437,357114


<h3>Conditional Selection (boolean arrays)</h3>
<p>Conditional selections used for SERIES will work the same way for DataFrame</p>

In [43]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [46]:
# What countries have a population greater than 70?
df['Population'] > 70 

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool

In [49]:
# Show only the countries with a population greater than 70 and show their population.
df.loc[df['Population'] > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


The boolean matching is done at index level, so you can filter by any row as long as it contains the right indxes. Column selection will work as expected.

In [50]:
# Show ONLY the population of countries with more than 70.
df.loc[df['Population'] > 70, 'Population']

Germany           80.940
Japan            127.061
United States    318.523
Name: Population, dtype: float64

In [52]:
# Show ONLY the population and GDP of countries with more than 70 citizens
df.loc[df['Population'] > 70, ['Population', 'GDP']]

Unnamed: 0,Population,GDP
Germany,80.94,3874437
Japan,127.061,4602367
United States,318.523,17348075


<h3>Dropping Stuff</h3>
<p>Opposite of 'selection', you can 'drop'. Instead of pointing out which values you'd like to select, you could point out which ones to drop. 

In [61]:
# Remove Canada from DataFrame view.
df.drop('Canada')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [62]:
# Remove Canada and Japan from DataFrame view.
df.drop(['Canada','Japan'])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [63]:
# Remove Population and HDI from DataFrame view.
df.drop(columns = ['Population', 'HDI'])

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [57]:
#Remove Italy and Canada
df.drop(['Italy', 'Canada'], axis=0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [58]:
df.drop(['Population', 'HDI'], axis=1)

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [59]:
df.drop(['Population', 'HDI'], axis='columns')

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [60]:
df.drop(['Canada', 'Germany'], axis='rows')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


<h3> Operations </h3> 

In [74]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


In [75]:
df[['Population', 'GDP']] / 100

Unnamed: 0,Population,GDP
Canada,0.35467,17853.87
France,0.63951,28336.87
Germany,0.8094,38744.37
Italy,0.60665,21677.44
Japan,1.27061,46023.67
United Kingdom,0.64511,29500.39
United States,3.18523,173480.75


<b>Operations with SERIES</b> work at a column level, broadcasting down the rows (which can be counter intuitive). 

In [76]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])

In [77]:
df[['GDP', 'HDI']] + crisis

Unnamed: 0,GDP,HDI
Canada,785387.0,0.613
France,1833687.0,0.588
Germany,2874437.0,0.616
Italy,1167744.0,0.573
Japan,3602367.0,0.591
United Kingdom,1950039.0,0.607
United States,16348075.0,0.615


<h3> Modifying DataFrames </h3>
<p>It's simple and intuitive, You can add columns, or replace values for columns without issues</p>
All of these drops create new DataFrames. To make changes 'in place', apply the <b>inplace = T/F</b> attribute

<b>Adding a new column:</b>

In [78]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)

In [79]:
df['Language'] = langs

In [80]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Language
Canada,35.467,1785387,9984670,0.913,
France,63.951,2833687,640679,0.888,French
Germany,80.94,3874437,357114,0.916,German
Italy,60.665,2167744,301336,0.873,Italian
Japan,127.061,4602367,377930,0.891,
United Kingdom,64.511,2950039,242495,0.907,
United States,318.523,17348075,9525067,0.915,


<b>Replacing values per column:</b>

In [81]:
df['Language'] = 'English'

In [82]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Language
Canada,35.467,1785387,9984670,0.913,English
France,63.951,2833687,640679,0.888,English
Germany,80.94,3874437,357114,0.916,English
Italy,60.665,2167744,301336,0.873,English
Japan,127.061,4602367,377930,0.891,English
United Kingdom,64.511,2950039,242495,0.907,English
United States,318.523,17348075,9525067,0.915,English


<b>Renaming Columns</b>

In [83]:
df.rename(
    columns={ 'HDI': 'Human Development Index',
              'Annual Popcorn Consumption': 'APC' }, 
    index={   'United States': 'USA',
              'United Kingdom': 'UK', 
              'Argentina': 'AR' } )

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Language
Canada,35.467,1785387,9984670,0.913,English
France,63.951,2833687,640679,0.888,English
Germany,80.94,3874437,357114,0.916,English
Italy,60.665,2167744,301336,0.873,English
Japan,127.061,4602367,377930,0.891,English
UK,64.511,2950039,242495,0.907,English
USA,318.523,17348075,9525067,0.915,English


In [84]:
df.rename(index=str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Language
CANADA,35.467,1785387,9984670,0.913,English
FRANCE,63.951,2833687,640679,0.888,English
GERMANY,80.94,3874437,357114,0.916,English
ITALY,60.665,2167744,301336,0.873,English
JAPAN,127.061,4602367,377930,0.891,English
UNITED KINGDOM,64.511,2950039,242495,0.907,English
UNITED STATES,318.523,17348075,9525067,0.915,English


In [85]:
df.rename(index=lambda x: x.lower())

Unnamed: 0,Population,GDP,Surface Area,HDI,Language
canada,35.467,1785387,9984670,0.913,English
france,63.951,2833687,640679,0.888,English
germany,80.94,3874437,357114,0.916,English
italy,60.665,2167744,301336,0.873,English
japan,127.061,4602367,377930,0.891,English
united kingdom,64.511,2950039,242495,0.907,English
united states,318.523,17348075,9525067,0.915,English


<h4>Dropping Columns</h4>

In [86]:
df.drop(columns='Language', inplace=True)

In [87]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI
Canada,35.467,1785387,9984670,0.913
France,63.951,2833687,640679,0.888
Germany,80.94,3874437,357114,0.916
Italy,60.665,2167744,301336,0.873
Japan,127.061,4602367,377930,0.891
United Kingdom,64.511,2950039,242495,0.907
United States,318.523,17348075,9525067,0.915


<h4>Adding Values</h4>

In [88]:
df.append(pd.Series({
    'Population': 3,
    'GDP': 5},
    name='China'))

Unnamed: 0,Population,GDP,Surface Area,HDI
Canada,35.467,1785387.0,9984670.0,0.913
France,63.951,2833687.0,640679.0,0.888
Germany,80.94,3874437.0,357114.0,0.916
Italy,60.665,2167744.0,301336.0,0.873
Japan,127.061,4602367.0,377930.0,0.891
United Kingdom,64.511,2950039.0,242495.0,0.907
United States,318.523,17348075.0,9525067.0,0.915
China,3.0,5.0,,


<b>NOTE:</b> 'Append' returns a new DataFrame:

In [90]:
df 
#Shows that the orignal DataFrame (df) was untouched.

Unnamed: 0,Population,GDP,Surface Area,HDI
Canada,35.467,1785387,9984670,0.913
France,63.951,2833687,640679,0.888
Germany,80.94,3874437,357114,0.916
Italy,60.665,2167744,301336,0.873
Japan,127.061,4602367,377930,0.891
United Kingdom,64.511,2950039,242495,0.907
United States,318.523,17348075,9525067,0.915


<h3> CREATING COLUMNS FROM OTHER COLUMNS </h3>
<p>Altering a DataFrame often involves combining different columns into another. For our example, we could calculate the 'GDP per capita', which is just <b>GDP / Population</b></p>

In [91]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


First, we just divide each series:

In [92]:
df['GDP']/df['Population']

Canada            50339.385908
France            44310.284437
Germany           47868.013343
Italy             35733.025633
Japan             36221.712406
United Kingdom    45729.239975
United States     54464.120330
dtype: float64

Then just add the results of that series to the original DataFrame: 

In [97]:
df['GDP Per Capita'] = df['GDP'] / df['Population']

In [100]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
Canada,35.47,1785387,9984670,0.91,50339.39
France,63.95,2833687,640679,0.89,44310.28
Germany,80.94,3874437,357114,0.92,47868.01
Italy,60.66,2167744,301336,0.87,35733.03
Japan,127.06,4602367,377930,0.89,36221.71
United Kingdom,64.51,2950039,242495,0.91,45729.24
United States,318.52,17348075,9525067,0.92,54464.12


<h3>Statistical Info</h3>
<p>Other methods of DataFrame summaries to gather better data insight</p>

In [108]:
df.head() #Snapshot of data

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
Canada,35.47,1785387,9984670,0.91,50339.39
France,63.95,2833687,640679,0.89,44310.28
Germany,80.94,3874437,357114,0.92,47868.01
Italy,60.66,2167744,301336,0.87,35733.03
Japan,127.06,4602367,377930,0.89,36221.71


In [110]:
df.describe() #Shows the statistically relevant data

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
count,7.0,7.0,7.0,7.0,7.0
mean,107.3,5080248.0,3061327.29,0.9,44952.25
std,97.25,5494020.16,4576186.57,0.02,6954.98
min,35.47,1785387.0,242495.0,0.87,35733.03
25%,62.31,2500715.5,329225.0,0.89,40266.0
50%,64.51,2950039.0,377930.0,0.91,45729.24
75%,104.0,4238402.0,5082873.0,0.91,49103.7
max,318.52,17348075.0,9984670.0,0.92,54464.12


In [106]:
# Assigning population data to a variable for easier data referencing.
population = df['Population']

In [111]:
population.min(), population.max()

(35.467, 318.523)

In [112]:
population.sum()

751.118

In [113]:
population.mean()

107.30257142857144

In [114]:
population.std()

97.24996987121581

In [115]:
population.median()

64.511

In [116]:
population.describe()

count     7.00
mean    107.30
std      97.25
min      35.47
25%      62.31
50%      64.51
75%     104.00
max     318.52
Name: Population, dtype: float64

In [117]:
population.quantile(.25)

62.308

In [119]:
population.quantile([.2, .4, .6, .8, 1])

0.20    61.32
0.40    64.17
0.60    74.37
0.80   117.84
1.00   318.52
Name: Population, dtype: float64