
### Review of Pandas, Basic Data Loading, and Pandas Plotting

#### Quick Review:

- Recall our fundamental objects: Series and DataFrames
- `apply` and element-wise operations
- Summarizing and computing descriptive statistics
- Python variable name binding

#### The Main Stuff:

- Loading data
- grouping
- plotting, etc.

In [None]:
#Import our basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Series

The two fundamental data structures in pandas are the *Series* and *DataFrame*

**Series** is a 1-D array-like object with sequence of values (types similar to NumPy types) + array of data labels called its *index*

In [None]:
#Make simple series and give the index some labels
obj = pd.Series([1, 3, 5, 7], index = ['a', 'b', 'c', 'd'])

obj

Can use either numeric index or labels in index to select values:

(Note possible ambiguity when labels are numeric)

In [None]:
obj[1]

In [None]:
#Note the right index is included with this format!
obj['a':'c']

In [None]:
#Note must use a list if getting multiple specific indices
obj[['a', 'b', 'd']]

#### Selection with loc and iloc

Select subsets of rows and columns using either axis labels (loc) or integer index (iloc)

- `loc`: Strictly label-based access
- `iloc`: Strictly integer-based access

Especially important when you have integer-valued labels

In [None]:
obj.iloc[0:2]

In [None]:
obj.loc['a':'c']

In [None]:
obj.loc[['a', 'c', 'd']]

#### Boolean Masking

Similar to NumPy, R, etc.

In [None]:
bool_obj = obj > 3
type(bool_obj)


In [None]:
obj.loc[bool_obj]

In [None]:
#Note boolean masking by a list of values:
obj.isin([5,7])

In [None]:
#And:
obj.loc[obj.isin([5,7])]

Can directly create a Series from data stored in a dictionary object:

In [None]:
#Area harvested for grain in 2020 in 1,000 acres, selected states; USDA Acreage Report
#https://usda.library.cornell.edu/concern/publications/j098zb09z?locale=en
####
sdata = {'Arizona': 29, 'Ohio': 3300, 'Texas': 1810, 'Oregon': 65, 'Iowa': 12900}

#Also set order, get NaNs if ask for anything not in the dictionary
states = ['Iowa', 'Arizona', 'Texas', 'Ohio', 'California']

obj = pd.Series(sdata, index = states)
obj

### DataFrame

Similar to R, rectangular table of data, where each column can be a different data type.

- Both row and column index

- Can also think of as a dict of Series all sharing same index

In [None]:
#Let's make a DataFrame from a dict of equal-length lists:
#Index is assigned automatically

#Principal Crops Area Planted
#USDA Acreage Report
#https://usda.library.cornell.edu/concern/publications/j098zb09z?locale=en

data = {'state': ['Arizona', 'Arizona', 'Arizona', 'California', 'California', 'California', 'Iowa', 'Iowa', 'Iowa'],
        'year': [2019, 2020, 2021, 2019, 2020, 2021, 2019, 2020, 2021],
        'area planted': [637, 573, 616, 2983, 2621, 2550, 23935, 24330, 24330]}
        
df = pd.DataFrame(data)
df

**Recall selection with loc and iloc**

Select subsets of rows and columns using either axis labels (`loc`) or integer index (`iloc`)
- `loc`: Strictly label-based access
- `iloc`: Strictly integer-based access

In [None]:
#For example:
df = pd.DataFrame(np.arange(16).reshape(4, 4),
                  index = ['one', 'two', 'three', 'four'],
                  columns = ['C1', 'C2', 'C3', 'C4'])

df

In [None]:
#Basic loc:

#Note this:
df.loc['one']

#vs. this:
df.loc['one':'one']


In [None]:
df.loc['one', ['C1','C2']]

In [None]:
#iloc examples:
###

df.iloc[2] #Try :2, :3
#df.iloc[2,:]

#df.iloc[[0,2,3]]

#df.iloc[1:3, 2:4]

In [None]:
#Note again the asymmetry between iloc and loc for the end index:
###

df.loc['one':'three', 'C2':'C4']

df.iloc[1:3, 2:4]

#### Function Application and Mapping

NumPy ufuncs also work on pandas objects:

In [None]:
df = pd.DataFrame(np.random.randn(4,3).round(2),
                  columns = list('ABC'),
                  index = list('abcd'))

display(df)

display(np.sqrt(df))

`apply` method: Applies a function across columns or rows, similar to apply in R:

In [None]:
df = pd.DataFrame(np.arange(12).reshape(4,3),
                  columns = list('ABC'),
                  index = list('abcd'))

display(df)

#By default, applies down the columns, can change axis:
df.apply(np.sum, axis='columns')

In [None]:
#Can use our own functions. Recall lambda keyword:
f = lambda x: x.max() - x.min()

df.apply(f)

In [None]:
#Also similar to R, can return more than a single scalar:
#Can also return a Series with multiple values:
f = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])

df.apply(f)

In [None]:
#Note: apply not usually necessary for common array statistical functions:
######

np.mean(df, axis='index')

To use an element-wise Python function with a DataFrame, use `applymap` method:

In [None]:
#Converts the input to a string, add 'S'
f = lambda x: str(x) + 'S'

df.applymap(f)

#### More Sorting and Ranking

We can sort by row or column index, using the `sort_index` method:

In [None]:
df = pd.DataFrame([[4,1,2], [1,9,0], [0,5,2], [9,5,1]],
                  columns = list('BAC'),
                  index = list('bcad'))

#df
df.sort_index(axis=1, ascending=False)

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=1, ascending=False)

In [None]:
#And again, to sort by values:
#df.sort_values(by = ['A'])

df.sort_values(by = ['A','C'])

In [None]:
#Can also rank items in a DataFrame
####

df.rank()
#df.rank(method="first")

#df.rank(axis="columns", method="first")

#method options: average, min, max, first, dense

#### Summarizing and Computing Descriptive Statistics

- pandas objects have set of common mathematical and statistical methods
- Usualy reductions or summary statistics: yield single value for Series, Series of values from rows or columns of a DataFrame
- Built-in handling for missing data

In [None]:
#Simple example:
df = pd.DataFrame([[1, np.nan], [2, 3], [4, np.nan], [5,6]],
                  index = list('abcd'),
                  columns = ['one', 'two'])
df

In [None]:
#Sum down the columns, ignoring NaNs:
df.sum()

In [None]:
#Sum across the columns, i.e. by index:
df.sum(axis='columns') #Or axis = 1

In [None]:
#Can do skipna = False:
df.sum(axis=1, skipna=False)

In [None]:
#idxmin and idxmax return index labels where min and max values attained:
#argmin and argmax retun index locations (integers) where min and max attained

#display(df.max().max())

df.idxmax()

In [None]:
#Can do cumulative sums and products:
df.cumsum()
#df.cumprod()

In [None]:
#And a bunch of summary statistics:
df.describe()

Methods:
- `count`
- `describe`
- `min`, `max`
- `argmin`, `argmax`
- `idxmin`, `indxmax`
- `quantile`
- `sum`
- `mean`
- `median`
- `mad` (mean absolute deviation from mean)
- `prod`
- `var`
- `std`
- `skew`
- `kurt`
- `cumsum`
- `cummin`, `cummax`
- `cumprod`
- `diff` (first arithmetic difference)
- `pct_change`

In [None]:
df.diff()

#### Finally: Variable name binding

In [None]:
#Remake our old df:
####

data = {'state': ['Arizona', 'Arizona', 'Arizona', 'California', 'California', 'California', 'Iowa', 'Iowa', 'Iowa'],
        'year': [2019, 2020, 2021, 2019, 2020, 2021, 2019, 2020, 2021],
        'area planted': [637, 573, 616, 2983, 2621, 2550, 23935, 24330, 24330]}
        
df = pd.DataFrame(data)
df

In [None]:
df2 = df

df2 is df

In [None]:
df2.loc[df2['state'] == 'Arizona', 'year'] = [1,2,3]
df2

In [None]:
#Also Note:
#To get cells by column values:
#display(df2.loc[(df2['state'] == 'California')])

display(df2.loc[(df2['state'] == 'California') & (df2['year'] == 2019)])

In [None]:
#Or just:
df2[(df2['state'] == 'Arizona') & (df2['year'] == 1)]

### Basic Data Loading

In [None]:
#Let's load some data on carbon emissions by nation
#######
url_name = r'https://zenodo.org/record/4281271/files/nation.1751_2017.csv?download=1'

df = pd.read_csv(url_name)

In [None]:
#Let's see what we got:
df.head()

Obviously some issues...Let's download and look at the .csv file in Excel and look:

In [None]:
from urllib import request

#URL for file
url_name = r'https://zenodo.org/record/4281271/files/nation.1751_2017.csv?download=1'

#Local file name to save to
local_file = r'Data/nation.1751_2017.csv'

#Download and save
request.urlretrieve(url_name, local_file)

In [None]:
#Looks like we have some front matter taking up 4 lines, so...
#Can either use header = 4
#OR
#skiprows = 4

#skiprows can also be a list, not just a number!

#Note: MUCH faster to read the local file, vs. the url
df = pd.read_csv(local_file, header = 4)

In [None]:
#Now:
df.head()

In [None]:
#We can set different columns names on reading like so:
#######

name_list = ['Nation', 'Year', 'Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Per_Capita', 'Bunker']

#Use header = 4, or somewhat more verbose:
df = pd.read_csv(local_file, skiprows = 5, header = None, names = name_list)

In [None]:
df

In [None]:
#Let's see what all the unique nations are:
################

display(df.Nation.unique())

In [None]:
#And how many are there?

display(len(df.Nation.unique()))

In [None]:
#And let's look at the US: Note the trailing white space
df.loc[df['Nation'] == 'UNITED STATES OF AMERICA '].head()

In [None]:
#We can strip the leading and trailing whitespace from our Nation names:
#Use lstrip() and rstrip() for just leading/following:

df.Nation = df.Nation.str.strip()

df.loc[df['Nation'] == 'UNITED STATES OF AMERICA'].head()

Note that it looks like we have a lot of missing data...
And note our data types:

In [None]:
#Get our data types:
####
df.dtypes

In [None]:
df.head()

In [None]:
#There are a few ways to deal with this...

#First note what these values are:
display(df.iloc[0,8])

#And note that there is a space!

In [None]:
#One option is to reload:
#####

name_list = ['Nation', 'Year', 'Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Per_Capita', 'Bunker']

df = pd.read_csv(local_file, header = 4, names = name_list, na_values = '. ')

df.head()

In [None]:
#Now our types?

df.dtypes

In [None]:
#We can also specify types when loading:
#Use a dictionary:
#######

df = pd.read_csv(local_file, header = 4, names = name_list, na_values = '. ',
                 dtype = {'Total': np.float64, 'Cement': np.float64, 'Bunker': np.float64})

df.dtypes

In [None]:
#Could also force the columns to a numeric type AFTER loading
###

df = pd.read_csv(local_file, header = 4, names = name_list)

#This will fail: Can't convert a string
df.Flaring.astype(np.float64)

In [None]:
#This will work: Use pd.to_numeric function
#Set errors = coerce so that if invalid parsing, give a NaN
df.Flaring = pd.to_numeric(df.Flaring, errors = 'coerce')

df.head()

In [None]:
#To convert all the columns we want, we can use apply:
##########
convert_cols = ['Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Per_Capita', 'Bunker']

df[convert_cols] = df[convert_cols].apply(pd.to_numeric, errors='coerce')

df.head()

#### Read only specific columns:

In [None]:
#Note: If all we wanted was Nation, Year, Total, and Per_Capita:
#Add usecols: Note we can use the new names

df = pd.read_csv(local_file, skiprows = 5, header = None, names = name_list,
                 usecols = ['Nation', 'Year', 'Total', 'Per_Capita'], na_values = '. ')

#Could also use column integer index:
#df = pd.read_csv(local_file, skiprows = 5, header = None, names = name_list,
#                 usecols = [0, 1, 2, 8], na_values = '. ')

In [None]:
df.head()

#### Filter the missing data

In [None]:
#Once again, reload:
#Let's just define a function at this point...

def get_carbon_df():
    local_file = r'Data/nation.1751_2017.csv'
    
    name_list = ['Nation', 'Year', 'Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Per_Capita', 'Bunker']
    new_df = pd.read_csv(local_file, header = 4, names = name_list, na_values = '. ')
    
    #Throw in stripping the whitespace from the names too:
    new_df.Nation = new_df.Nation.str.strip()
    
    return new_df

In [None]:
#Once again, reload:
df = get_carbon_df()

#We managed to get NaNs where the data was missing, can we filter out more?
#Yes, but first:

#This snippet gets the rows where anything is null:
display(df[df.isnull().any(axis=1)]);

#How many entries have a NaN?
len(df[df.isnull().any(axis=1)])


Our main NA handling methods:

- `dropna()`
- `fillna()`
- `isnull()`
- `notnull()`

In [None]:
#We can either drop any rows with NAs:
df2 = df.dropna()

print(len(df))

print(len(df2))

In [None]:
#Or fill them with something, like 0, if that seems appropriate:
df.fillna(0)

In [None]:
#We can also fill different columns with different values:
#Use a dictionary...

#First, get a dataframe with lots of NaNs to better demo:
df_bad = df[df.isnull().any(axis=1)]

#Sample 20 of the finest rows!
df_bad.sample(20)

In [None]:
#Note that we can get false positive SettingWithCopyWarnings
df_bad.loc[:,'Flaring'] = 99
df_bad

In [None]:
#Now fill NaNs in only very select columns using a dictionary

df_bad.fillna({'Per_Capita': 999999, 'Flaring':-888}).sample(20)

#### Back to dropping...

In [None]:
#We can also restrict ourself to only dropping rows that are *all* NaN:
########

df2 = df.dropna(how = 'all')

print(len(df))

print(len(df2))

In [None]:
#Or if specific columns have NaNs:
########

df2 = df.dropna(subset = ['Solid', 'Flaring'])

print(len(df2))

#### Drop columns with NaNs?

In [None]:
#Set our axis to 1 and we will drop columns that contain any NaNs:
df2 = df.dropna(axis = 1) #, how = 'all') #, subset=[0]

df2.head()

### Using GroupBy with Pandas

1. *Split* DataFrame into groups based on one or more *keys*, along a particular axis (rows, `axis = 0`; or colums, `axis = 1`)
2. Apply a function to each group, producing a new value
3. Comine results into a new object

<img src="split_apply_flow.jpg" alt="drawing" style="width:550px;"/>

Let's demo with something simpler, and then return to our carbon emissions dataframe

In [None]:
#Make a simple dataframe
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': [20, 30, 40, 50, 60],
                   'data2': [10, 11, 12, 13, 14]})

df

In [None]:
#Let's say we want the mean of our data grouped by key1:
df_grouped = df.groupby('key1')

#Now we have a GroupBy object:
df_grouped

In [None]:
#Count how many are in each group:
df_grouped.size()

#Or:
#df_grouped.count()

In [None]:
#Do the means:
##############

df_grouped.mean()

#type(df_grouped.mean())

Note that the `key2` column was dropped above, as it is non-numeric and a mean would not make sense

In [None]:
#key1 is now our index. To reset to a column:
#############

df_grouped.mean().reset_index()

In [None]:
#Note that the above is a DataFrameGroupBy object
#Can also get a SeriesGroupBy object:

grouped = df['data1'].groupby(df['key1'])

grouped

In [None]:
#And get mean: (Or sum, etc.)
grouped.mean()


We can group by multiple keys:

In [None]:
df_grouped = df.groupby(by = ['key1', 'key2'])

#We get mean by two keys, and note the object that is created is a DataFrame
df_means = df_grouped.mean()
df_means

#type(df_means)

In [None]:
#The above Dataframe has a hierarchical index, consisting of unique pairs of the keys:
df_means.index


In [None]:
#And we can turn this hierarchical index into columns by reseting the index:
df_means.reset_index()

In [None]:
#We can also get the sizes of our various groups:
df.groupby(by = ['key1', 'key2']).size()

In [None]:
#OR:
df.groupby(by = ['key1', 'key2']).count()

#### We can iterate over groupby objects:

In [None]:
#For example:
for name, group in df.groupby(by = 'key1'):
    print(name, '\n')
    print(group, '\n')
    
    #group now is a DataFrame we could use however we like
    print(type(group))

Lots more cool grouping that can be done, consult the docs for more!

#### Can also apply your own aggregation functions to groupby objects: Use `agg()` method

In [None]:
peak_to_trough = lambda x: x.max() - x.min()

In [None]:
#Again, simple grouping:
df_grouped = df.groupby('key1')

df_grouped.agg(peak_to_trough)

### Back to emissions data

In [None]:
#Grab the data again, and remind ourselves
df = get_carbon_df()

df.head()

In [None]:
#We can group by nation, get sum of emissions
df_grouped = df.groupby(by = 'Nation')

df_grouped

In [None]:
#Get a DataFrame of just the sums:
df_sums = df_grouped.sum()

#Note that Year and Per_Capita are now meaningless

#If we just wanted Total:
#df_sums = df_grouped[['Total']].sum()

display(df_sums.head())

#Plot the Total cumulative emissions histogram:
df_sums['Total'].hist(bins = 50, edgecolor='black', facecolor=(.5, .1, .1), grid=False, log=True, figsize=(8,6));

In [None]:
#How much fossil (+cement) carbon has been emmitted by all countries since 1750??
#In Million Metric Tons
#Note that the following is a Series object:
df_sums.sum() / 1e6


In [None]:
#Get the top cumulative emitters:
#Divide by 1e6 to convert to million metric tons
######

df_sums.sort_values(by = ['Total'], ascending=False).head(20) / 1e6

#Get number one:
#df_sums.sort_values(by = ['Total'], ascending=False).head(20).index[0]

In [None]:
#Can also plot all the variables:
###
#Let's add a bigger figure/axis:
fig1, ax1 = plt.subplots(1, 1, figsize=(12,8))

#Exclude Year:
df_sums.iloc[:,1:9].hist(bins = 30, ax = ax1, edgecolor='black', facecolor=(.5, .1, .1), grid=False, log=True);

In [None]:
#To put Nation name, which is now the index, back to a column:
#####

df_sums.reset_index().head()

### Now let's get cumulative emissions and plot...

In [None]:
df = get_carbon_df()

In [None]:
#The following bit of code with groupby Nation will take cumulative sum down the years:
#Could do this:
#df = df.fillna(0)

df_cumulative = df.copy()

#df_cumulative Per_Capita: Doesn't make sense to cumulative sum
df_cum.drop(columns = 'Per_Capita', inplace=True)


convert_cols = ['Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Bunker']

#Note we preserve the Year, doing this
#NOte the division by 1e6
df_cumulative[convert_cols] = df.groupby(by = ['Nation'])[convert_cols].cumsum() / 1e6

display(df_cumulative.head())

#Plot the results for the US:
df_cumulative.loc[df_cumulative['Nation'] == 'UNITED STATES OF AMERICA'].plot(x = 'Year')

In [None]:
#The following is an alternative method to take the cumulative sums (preserving year):
######

#df_cumulative = df.copy()

#Drop Per_Capita: Doesn't make sense to cumulative sum
df_cumulative.drop(columns = 'Per_Capita', inplace=True)

df_cumulative = df.groupby(by = ['Nation', 'Year'])
df_cumulative = df_cumulative.sum().groupby(level=0).cumsum().reset_index()

#####

In [None]:
#We could use grouping to plot individual country time-series:
####

fig1, ax1 = plt.subplots(1, 1, figsize=(14,8))

#Sneak in an enumerate:
####
for i, (name, group) in enumerate(df_cumulative.groupby(by = 'Nation')):
    #print(name, '\n')
    #print(group, '\n')
    
    group.plot(x = 'Year', y = 'Total', ax = ax1, legend=False, logy=True, label=name)
    
    if (i > 25):
        break
        
ax1.legend(ncol=2, loc='upper left', fontsize=9)

#### Get Total Cumulative Emissions by Year

In [None]:
#Let's group by year and sum, to get total global emissions!:
########

df_world = df_cumulative.groupby(by = 'Year').sum()

df_world = df_world.reset_index()

#We have some very odd points where the cumulative sums go down:
df_world.plot(x = 'Year', figsize=(8,6))


In [None]:
#Let's compare the above to the provided global emissions data series...
####

url_name = 'https://zenodo.org/record/4281271/files/global.1751_2017.csv?download=1'

#Let's go ahead and download...
global_file = r'Data/global.1751_2017.csv'

#Download and save
request.urlretrieve(url_name, global_file)


In [None]:
#No Nation or Bunker columns for this dataset:
name_list = ['Year', 'Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Per_Capita']

df_global = pd.read_csv(global_file, header = 4, names = name_list)
df_global.head()

#This is the global yearly emissions series:
df_global.plot(x = 'Year', figsize=(8,6))

In [None]:
#What do we have as our grand totals?
df_global.sum() / 1e3

In [None]:
#Compare to our previous:
df_sums.sum() / 1e6

In [None]:
#Let's take the cumulative sums and look at our time-series...
#####
df_global_cum = df_global.copy()

df_global_cum.iloc[:,1:8] = df_global.iloc[:,1:8].cumsum()



#Plot the new data and the old, just the Totals
fig1, ax1 = plt.subplots(1, 1, figsize=(8,6))

df_global_cum[['Total', 'Year']].plot(x = 'Year', ax = ax1)

#Need to divide by 1000: Different unit scalings for the datasets
ax1.plot(df_world.Year, df_world.Total / 1000, label='Summed Total')
ax1.legend()

In [None]:
#And look at our dataframe
df_global_cum

In [None]:
#Turn out, there are negative values in the National series, e.g.:
####

df.loc[df['Nation'] == 'ISLAMIC REPUBLIC OF IRAN'].plot(x = 'Year')


In [None]:
#Let's see all the values < 0:
df.loc[df['Total'] < 0]

In [None]:
#Let's filter any negative values in the national series to 0
#Unclear why this happened, but oh well let's try...

df = get_carbon_df()

#Iterate through everything but Names and Year:
#Also track total negative values for the Total column
neg_sum = 0

for k in df.columns[2:]:
    if (k == 'Total'):
        neg_sum = neg_sum + df.loc[df[k] < 0, k].sum()
    
    df.loc[df[k] < 0, k] = 0
    
neg_sum

In [None]:
#Confirm filtering!

df.loc[df['Total'] < 0]

In [None]:
#Now plot again:

df.loc[df['Nation'] == 'ISLAMIC REPUBLIC OF IRAN'].plot()

In [None]:
#Now let's repeat the steps above...

#We have our shiny new, filtered df:
df_cumulative = df.copy()


#Drop Per_Capita: Doesn't make sense to cumulative sum
df_cumulative.drop(columns = 'Per_Capita', inplace=True)

convert_cols = ['Total', 'Solid', 'Liquid', 'Gas', 'Cement', 'Flaring', 'Bunker']

df_cumulative[convert_cols] = df.groupby(by = ['Nation'])[convert_cols].cumsum()

#Let's group by year and sum, to get total global emissions!:
########

df_world = df_cumulative.groupby(by = 'Year').sum()


#We *still* have some very odd points where the cumulative sums go down:
df_world.plot()



In [None]:
#First shifts happen around 1947
display(df_world.loc[1947] - df_world.loc[1946])

In [None]:
#Let's see how many countries are contributing at each year:

num_countries = df_cum.groupby('Year').size()

num_countries.plot()

#How many countries around from 1940 through 1960?
display(num_countries[(num_countries.index < 1960) & (num_countries.index > 1940)])

#How many countries around from 1975 through 1995?
display(num_countries[(num_countries.index > 1975) & (num_countries.index < 1995)])

#Turns out different counties have data for some years, so get the jumps...

In [None]:
#The one that explains dip at 1947: Germany!
####

for k, (name, group) in enumerate(df_cum.groupby(by = 'Nation')):
    
    if 1946 in group.Year.values and not (1947 in group.Year.values):
        print(name)

    #What about:
    #And flip these also:
    #if 1992 in group.Year.values and not (1989 in group.Year.values):
    #    print(name)
        

In [None]:
df_cum.loc[df_cum['Nation'] == 'GERMANY'].plot(x = 'Year')

In [None]:
#And what where Germany's emissions 1946:
df_cumulative.loc[df_cumulative['Nation'] == 'GERMANY'].loc[df_cumulative['Year'] == 1946]

In [None]:
#And difference 1946 to 1947 calculated global cumulative?
df_world.loc[1947] - df_world.loc[1946]

In [None]:
#Dip in the early 1990s?
#Clearly related to the breakup of the former Soviet Union and the reshuffling of countries

### Plotting With Pandas


In [None]:
#We've already plotted some times-series
#Let's just use our global times-series here

#We can specify an axis as above:
fig1, ax1 = plt.subplots(1,1, figsize=(8,6))


#Recall plotting one series in the dataframe:
df_global.plot(x = 'Year', y = 'Total', ax=ax1)


In [None]:
#And plotting several:
fig1, ax1 = plt.subplots(1,1, figsize=(8,6))

#df_global.plot(x = 'Year', y = ['Solid', 'Liquid', 'Gas'], ax=ax1)

#For all:
df_global.plot(x = 'Year', ax=ax1)

In [None]:
#Can also do cumulative sum like so
fig1, ax1 = plt.subplots(1,1, figsize=(8,6))

df_global.cumsum().plot(x = 'Year', y = ['Solid', 'Liquid', 'Gas'], ax=ax1)

Can specify `kind` argument to `plot()` method, with options:

- `bar` or `barh` for bar plots

- `hist` for histogram

- `box` for boxplot

- `kde` or `density` for density plots

- `area` for area plots

- `scatter` for scatter plots

- `hexbin` for hexagonal bin plots

- `pie` for pie plots


In [None]:
#Let's do an area plot:
#######

fig1, ax1 = plt.subplots(1,1, figsize=(8,6))

df_global.plot(x = 'Year', kind = 'area', ax = ax1)

In [None]:
#But we don't really want Total or Per_Capita
#Let's exclude:
df2 = df_global.loc[:, df_global.columns.difference(['Total', 'Per_Capita'])]


fig1, ax1 = plt.subplots(1,1, figsize=(8,6), dpi=90)

#Plot: Can change the colormap to various options:
df2.plot(x = 'Year', kind = 'area', ax = ax1, cmap='Reds')

In [None]:
df2

In [None]:
#And cumsum:
fig1, ax1 = plt.subplots(1,1, figsize=(8,6), dpi=90)

#Let's preserve year here:
df2.iloc[:,:-1] = df2.iloc[:,:-1].cumsum()

#And plot:
df2.plot(x = 'Year', kind = 'area', ax = ax1, cmap='Set2')

In [None]:
#Note pd.to_datetime:
#Also note that we need to convert to a string first:
df2.Year = pd.to_datetime(df2.Year.astype(str))

#And plot:
fig1, ax1 = plt.subplots(1,1,figsize=(8,6), dpi=90)
df2.plot(x = 'Year', kind = 'area', ax = ax1, cmap='Set2')

### Finally, melt + a pie chart...

In [None]:
df2.iloc[-1:-2:-1]

In [None]:
#Let's get a DataFrame that is just our last year
final_emissions = df2.iloc[-1:-2:-1]

final_emissions

We'd like to make a pie chart showing the relative contributions of each fossil type, but pie() takes a single column as the `y` argument

Let's `melt`:

In [None]:
df_long = pd.melt(final_emissions, id_vars=['Year'], value_vars=final_emissions.columns[0:5],
        var_name='Category', value_name='Emissions')

df_long

In [None]:
#Plot make our much anticipated pie chart:
df_long.plot.pie(y = 'Emissions', figsize=(6,6))

#Gah! Labels are all wrong

In [None]:
#To fix, let's set our index:
df_long = df_long.set_index('Category')

df_long

In [None]:
#Plot our even more anticipated pie chart:
fig1, ax1 = plt.subplots(1,1, figsize=(6,6))

df_long.plot.pie(y = 'Emissions', figsize=(6,6), fontsize=14, ax=ax1)

#Try with and without this:
ax1.legend(fontsize=14, loc='upper right', bbox_to_anchor=(1.4, 1))
ax1.set_ylabel('Emissions', fontsize=14)


### Some simpler plotting demos from here out...

In [None]:
#Just make some random stuff:
df = pd.DataFrame(np.random.rand(10, 4).round(2)*100, columns=list("ABCD"))

#Note that we can mix in other matplotlib plotting on the same axis:
####

fig1, ax1 = plt.subplots(1,1, figsize=(8,6))
df.plot(ax = ax1, linewidth=3, cmap='Reds')

#Add some matplotlib
x = np.arange(0,10)
y = x*10
ax1.plot(x, y, linewidth=5, color='blue')

In [None]:
df

#### Can also plot using method pd.plot.\<kind\>

In [None]:
#Can try:
#df.plot.<TAB>
df.plot.

In [None]:
#Do a simple bar plot:
df.iloc[1].plot.bar()

In [None]:
#Do a slightly less simple:
df.plot.bar()

In [None]:
#To stack, or do horizontal:
fig1, ax1 = plt.subplots(1,2, figsize=(14,6))

df.plot.bar(ax = ax1[0], stacked=True)
df.plot.barh(ax = ax1[1], stacked=True)

In [None]:
#Can make a boxplot:
df.plot.box()

In [None]:
#Or:
#We can also plot one column or more column, grouped by another:

#Remake df with more points:
df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))

#Set column C = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], but more compactly:
df["C"] = np.array([1]*5 + [2]*5)

fig1, ax1 = plt.subplots(1,1, figsize=(14,6))

df.boxplot(column = ["A"], by = "C", ax = ax1) #, grid=False)

#ax1.grid(False)
#fig1.suptitle('Replace the Default');
#ax1.set_title('Replace this default too')

In [None]:
#Can do custom colors, positions, and widths, etc:
####

my_colors = {
    "boxes": "DarkGreen",
    "whiskers": "DarkOrange",
    "medians": "DarkBlue",
    "caps": "Gray",
}

my_positions = [1, 4, 5, 7]
my_widths = [.5, 1.5, .75, .2]

df.plot.box(color = my_colors, sym='*', vert=False,
             positions=my_positions,
             widths=my_widths)

In [None]:
#And can do custom box, whisker, cap, flier, and median props:
boxprops = dict(linewidth = 3, color = 'black')

whiskerprops = dict(linestyle = '-', linewidth=3, color='red')
boxprops = dict(linewidth = 3, color = 'red')
capprops = dict(linewidth = 3, color = 'red')
flierprops = dict(markersize=10, markeredgewidth=2, markeredgecolor='red', markerfacecolor='red')
medianprops = dict(linewidth = 3, color = 'red')

df.boxplot(boxprops = boxprops,
            whiskerprops = whiskerprops,
            capprops = capprops,
            flierprops = flierprops,
            medianprops = medianprops,
            rot = 45, grid=False)


In [None]:
#Make a pie plot from a Series: y value unambiguous here (vs DataFrame)
###

df.iloc[2].plot.pie();

In [None]:
#Histograms, again:
####

df3 = pd.DataFrame({"A":np.random.randn(1000) - 1, "B":np.random.randn(1000), "C":np.random.randn(1000) + 1})


fig1, ax1 = plt.subplots(1,2, figsize=(14,6))

df3.plot.hist(alpha=0.5, bins=30, ax=ax1[0])
df3.plot.hist(stacked=True, bins=30, ax=ax1[1]) #orientation="horizontal"

In [None]:
#df3.diff()

In [None]:
#Can plot on multiple subplots like so:
fig1, ax1 = plt.subplots(1,1, figsize=(12,6))

df3.hist(color="darkblue", bins=50, ax=ax1);

In [None]:
#Also can do a kernel density estimate:
#####
fig1, ax1 = plt.subplots(1,1, figsize=(12,6))

df3['A'].plot.kde(bw_method=1)


In [None]:
#Plot both histogram and kde together...
fig1, ax1 = plt.subplots(1,1, figsize=(12,6))

df3['A'].hist(color="darkblue", bins=50, ax=ax1, alpha=.5, density=True)

x = df3['A'].plot.kde(ax=ax1, linewidth=5, color='red')

In [None]:
#Want to integrate the kernel?
#############

import scipy

#Estimate the kernel
kernel = scipy.stats.gaussian_kde(df3['A'])

#Get the kernel at certain points:
y = kernel(np.linspace(-8,8,100))

display(type(y))

plt.plot(y.cumsum() / y.sum())
plt.plot(y)

In [None]:
#All together:

fig1, ax1 = plt.subplots(1,1, figsize=(12,6))

h = df3['A'].hist(color="darkblue", bins=50, ax=ax1, alpha=.5, density=True)

x = df3['A'].plot.kde(ax=ax1, linewidth=5, color='red')

#Estimate the kernel
kernel = scipy.stats.gaussian_kde(df3['A'])

#Get the kernel at certain points:
x_points = np.linspace(-8,8,100)
y = kernel(x_points)

ax1.plot(x_points, y.cumsum() / y.sum())

Finally, the hexbin:

In [None]:
###Make a hexbin!

df3.plot.hexbin(x = 'A', y='B', gridsize=20, cmap='viridis', figsize=(8,8))


As always, there's a lot more you can do with Pandas, grouping, plotting, and so on...