# MFRE Summer Session: Python Workshop 2


## 1.	Data Manipulation with `pandas`
- 1.1	Introduction to Pandas Library for data manipulation
- 1.2	Working with pandas’ data structures (Series, data frame)




## 2. Hands-on Work: Data Cleaning and Exploratory Analysis

2.0  Data import
- Code taught: `pd.read_csv()`

2.1	How many rows and columns are in the sector GHG emissions dataset?
- Code taught: `.shape`
- Exercise: how many in regional GHG emissions dataset? Larger/smaller than sector?

2.2	What are the column names and data types in the dataset?
- Code taught: `.columns`, `.dtypes`
- Exercise: Get the column names and datatypes of the `regional_emissions` dataset.

2.3	What is the total greenhouse gas emissions for each province in Canada?
- Code taught: `.sum()`
- Exercise: get sum of each column and row of `regional_emissions`. How are these values different?

2.4	Which industry had the single highest year of emissions? The single lowest? What were these values?
- Code taught: `.idxmax()`, `.idxmin()`, `.max()`, `.min()`
- Exercise: get these from `regional_emissions`

2.5 Slicing and selection with `.loc[]`, `.iloc[]`
- Code taught: `.loc[]`, `.iloc[]`

2.6 Filtering with boolean conditions
- Code taught: `df[df["column"] < value]`

2.7 Column Operations
- Code taught: `df["new_column"] = df["column_a"]*df["column_b"]`
- calculating percentage change over time

2.8 Summary Stats: pandas function
- `.describe()`

2.9 Open-Ended Exercise

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#  0. Data Import with `read_csv()`

![memorize all of these](https://i.imgur.com/BJq4hmO.png)

Above is the Pandas `read_csv()` function, accessed as above, with `pd.read_csv()`.

A partial list of the function's possible arguments is shown in the image; you will need to memorize all of these, as well as some others that aren't printed, if you want to succeed in the MFRE program.

I will close this page and we'll go around the room. Everyone needs to give a full explanation of any one of these arguments.

...

...


...

...

...

Yeah, I'm kidding. I've never read most of these, let alone having used them.

For the normal case, where you're using Python based in a development environment on your own computer, you'd follow the following process:

Call `pd.read_csv()` with just one argument: the path to your file. I'll go over the typical process I use for this.

![image1](https://i.imgur.com/PQ22zSg.png)

First, right-click your `.csv` file, and click *Properties* in the menu, circled in red.

![image2](https://i.imgur.com/hiGKFcc.png)

Next, navigate to the *Security* tab circled in red, and copy the `Object name` section, circled in blue.

Next, go into your Python workbook, and paste this into the `pd.read_csv()` file, surrounded by quotation marks.

In [None]:
# this won't work

regional_emissions = pd.read_csv("C:\Users\jizatt\Documents\summer_session_workshops\regional_emissions.csv")

But this throws an error!

Frustratingly, the function is set to need all "/" slashes, not "\" slashes as default.

Don't ask me if these are back-slashes or forward slashes - I never could remember!

We have to go back through and replace all of these.

In [None]:
# replace all \ slashes with / slashes and this should work!

regional_emissions = pd.read_csv("C:/Users/jizatt/Documents/summer_session_workshops/MFRE-summer-session-workshops-2023/regional_emissions.csv")

regional_emissions.head()

# this won't work either

If we were working in Anaconda, or VSCode, this would work fine, since those are based out of your own computer. So, remember this process for if you do work in those environments!

We're using Google Colaboratory, which based in the cloud. This means it doesn't have inherent access to your system files.

The overall easiest way to import data, regardless of your platform, involves using Google Drive. I'll use it to *actually* import the workshop data.

First, you upload your `.csv` data file to Google Drive, and set sharing to "Anyone with the link", then copy the download link. You will use the following code to adjust the URL for download.

In [None]:
# regional_emissions.csv Google Drive method

url = "https://drive.google.com/file/d/1M0Ab8MwvP9d7_Lr-p09xYeJqVsJzB-x4/view?usp=sharing"
path = "https://drive.google.com/uc?export=download&id="+url.split('/')[-2]
regional_emissions = pd.read_csv(path)

regional_emissions.head()

In [None]:
# sector_emissions.csv Google Drive method

url = "https://drive.google.com/file/d/1Mq5fuv5tqBexcjbuuZhoT2m-IpCb8SSX/view?usp=sharing"
path = "https://drive.google.com/uc?export=download&id="+url.split('/')[-2]
sector_emissions = pd.read_csv(path)

sector_emissions.head()

In [None]:
# in case google drive method doesn't work, we can import manually with this as base

# sector_emissions = pd.read_csv("C:/Users/jizatt/Documents/summer_session_workshops/sector_emissions.csv")

# 1.	How many rows and columns are in the sector emissions dataset?

When you're working with a Pandas DataFrame object, it's important to know the shape; this is a `(n, q)` tuple where `n` is the length and `q` is the width.

Typically, a dataset's rows are samples while its columns are variables, so `n` tells you the sample size, and `q` tells you the number of variables counted.

In [None]:
sector_emissions.shape

Length comes first, then Width. The former is how many rows, i.e. samples the dataframe has; the latter is how many columns, i.e. variables.

What if we look at just one column?

In [None]:
sector_emissions["Agriculture"].shape

In [None]:
type(sector_emissions["Agriculture"])

When we call just one column of a DataFrame, it turns into a series, which gets reported as `(32,)`

In [None]:
sector_emissions[["Agriculture", "Heavy industry"]].shape

However, when we call its columns with a list of column names - even just one - it stays as a DataFrame and gets reported as `(32, 1)`.

In [None]:
sector_emissions[["Agriculture"]].shape

In [None]:
type(sector_emissions[["Agriculture"]])

# Exercise 1:

Get the shape of `regional_emissions`. How many rows and columns are in it?

Is it a bigger or smaller dataset than `sector_emissions`?

We can see it's smaller in both length (13 vs 30 observations) and width (5 vs 8 observations), so it must be smaller overall.

# 2.	What are the column names and data types in the dataset?


While we can just call a DataFrame and read off the columns, there will be times you want to specifically get the list of column names in a dataset

In [None]:
sector_emissions.columns

You can use `df_name.columns` and have Python return them that way. However, this can't be taken and inputted elsewhere directly.

What about `list(df_name.columns)` ?

In [None]:
list(sector_emissions.columns)

That's more like it. We can slice this just like any other list:

In [None]:
list(sector_emissions.columns)[:4] # first four columns

To get the data type of each column, use `df_name.dtypes`

In [None]:
sector_emissions.dtypes

### Why bother? Reasoning behind Parts 1/2

When you're importing a dataset, you probably have some idea of what size it is - roughly how many variables it has, and what ballpark the sample size is in. Calling `.shape` lets you quickly check if something is glaringly wrong. `.colnames` and `.dtypes` let you make sure that everything you need is there, and that all the data are in the format you expect.

This way, if something is wrong, you can catch it early, before you spend a bunch of time stumbling around through various error messages, or worse, doing work without knowing something is wrong!

# Exercise 2:

Get the column names and datatypes of the `regional_emissions` dataset.

# 3. What is the total greenhouse gas emission for each economic sector in Canada?

If you remember from Workshop 1, we actually did this there too. It involved taking every data column of `sector_emissions` and looping through it, adding up each value until we had our total.

There are two issues with this approach.

1. It's code-intensive. We had to write each loop, or at least copy and modify the loop, for each column. This creates work for us, and fills a lot of space in our workbook, making the overall file less easily readable.

2. It's processing-inefficient. This dataset is small, so that's fine. But if you work with datasets of thousands, tens of thousands, hundreds of thousands, or more samples, then looping will rapidly become infeasible.

I once asked a professor for help with code for my undergraduate thesis project. When he saw I was using loops to clean my dataset, he couldn't stop laughing. He was right, and still a big help! But let me spare you some pain.

(And I promise not to laugh at you!)

In [None]:
sector_emissions.head()

In [None]:
# Python Workshop 1-style approach

oil_gas = sector_emissions["Oil and gas"]
oil_gas_total = 0

for num in oil_gas:
    oil_gas_total += num

oil_gas_total

In [None]:
list(sector_emissions.columns)[1:]

We can actually run this as a loop within a loop. For each column, we'll total up the values within and print them.

In [None]:
%%time

for column in list(sector_emissions.columns)[1:]:
    total = 0
    for num in sector_emissions[column]:
        total += num
    print(column + " total equals: " + str(total))

In [None]:
%%time

sector_emissions[1:].sum()

The CPU time is close to zero, and the Wall time is also insignificant, but even this reveals how the vectorized method with`.sum()` takes around half the time of the list.

The distinction becomes much more clear when the sample size gets bigger:

![stack_overflow](https://i.imgur.com/Zk7op8v.png)

[Source](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care)

List comprehensions are optimized versions of loops, written in a different format, but the point should hold; out beyond a few thousand sample size, they are slower and scale linearly with `n` value. In comparison, vectorized operations such as with Pandas functions (built out of Numpy) rise in computation time only slowly.

The scale is worth noting here; at any sample size you are likely to work on in MFRE, all of these will be negligible. But with large data the difference does grow, and other concerns such as readability still push us towards using functions over loops.

# Exercise 3:

Take the `regional_emissions` dataset and find the sum of each column. What have you calculated with this?

Next, input the argument `axis = 1` and take the `.sum()`. What did you calculate this time?

However, you should still stick with dedicated functions over loops when possible. They're much more readable, and easy to write as well. Compare the 5 lines of code for the loop above with the 1 line for the `.sum()` call.

# 4. Which industry had the single highest year of emissions? The single lowest?

The index of a Pandas DataFrame is the set of labels running down the left-hand side; for `sector_emissions`, this means the 0, 1, 2, 3, etc, up to 31 in the bottom row. The index is vital for accessing, aligning, and joining datasets.

For example, if you have two different yearly datasets, one with values of interest rates, and one with unemployment rates, you could use the yearly index to join both together so that all the years match.

We'll use it here to find certain values, with the `.idxmax()`, `.idxmin()`, `.max()`, and `.min()` functions.

But first, we'll need to explicitly set our index as the `"Year"` column, with `.set_index()`

In [None]:
sector_emissions

In [None]:
sector_emissions_index = sector_emissions.set_index("Year")
sector_emissions_index

As you saw, we assigned this as a new variable, `sector_emissions_index`, naming the `"Year"` column as the index.

You can also just set the index on the current dataframe. By default, `set_index()` just creates the new variable and returns it; you have to assign it to a new variable like we did above.

Instead, we can use the argument `inplace = 1`. This of this as saying "set the index to year, and perform it in place."

In [None]:
sector_emissions.set_index("Year", inplace = True)

In [None]:
sector_emissions.head()

We want one of these in "vanilla" form, so I'll use `.reset_index()` to turn it back to normal.

In [None]:
sector_emissions.reset_index(inplace=True)

In [None]:
sector_emissions.head()

And as you can see, the index is gone.

#### Why does the index matter?

1. It labels and identifies the rows.
2. We can select and slice data using it.
3. We can join datasets using it.

Numbers 1 and 2 are our main interests here; Joining is a topic for another day. But Even just for what we're doing now, the index plays a vital role.


# Exercise 4a:

Take the `regional_emissions` index. Using the `.set_index()` command, create a variable called `regional_emissions_index` with the `"Region"` variable set as its index.

Then, set the index of `regional_emissions` as `"Region"` without creating a new variable.

Last, reset the index of `regional_emissions`.


In [None]:
regional_emissions

### [Exercise End]

# `.max()`, `.min()`, `.idxmax()`, `.idxmin()`

Now that we've seen how the index works, we'll use it with the above functions to retrieve the greatest and lowest values of the columns.

In [None]:
sector_emissions_index.max()

In [None]:
sector_emissions_index.min()

In [None]:
sector_emissions_index

As we can see, using the `.max()` and `.min()` methods on the indexed datasets directs Python to return the greatest or lowest value in each column.

We may also be interested in what *year* we have these maximums and minimums. When our index is the `"Year"` variable, the `.idxmax()` and `.idxmin()` values can return the matching years:

In [None]:
sector_emissions_index.idxmax()

In [None]:
sector_emissions_index.idxmin()

There's some interesting information here. Most of the sectors had their maximum emissions levels recently, and their lowest near the start of the dataset. But `"Electricity"` maxed out in 2001 and had its minimum value in the most recent year, 2021. `"Heavy industry"` and `"Waste and others"` also have unusual patterns.

# Exercise 4b:

Take the `.max()`, `idxmax()`, `.min()`, and `.idxmin()` values of the `regional_emissions_index` dataset. Do you notice any interesting patterns?

# 5. Slicing and Selection with `.loc[]` and `.iloc[]`

When we were working with lists, we often would "slice" out specific sections by putting a number, or series of numbers, in behind a list object.

For example:

In [None]:
simple_list = [1, 2, 3, 4, 5]

simple_list[2:4]

Filtering a DataFrame involves reducing the full dataframe to a subset, where the set of rows match one or more conditions. The `.loc[]` and`.iloc[]` functions can both do this, but approach it in a different way:

- `.loc[]` is based on column/row labels, or names. You name your columns and what condition must apply to them. The dataset returned will be limited to rows where your conditions are met.

- `.iloc[]` does the same, but specifying rows/columns by their integer places.


I'd advice you to prioritize `.loc[]`, because it isn't sensitive to columns having their ordering changed.

### Data Selection: One Value

Note: `df` is a generic name for any Pandas DataFrame object.

- `df.loc[row_label, column_label]`

- `df.iloc[row_position, column_position]`

In [None]:
regional_emissions.head()

In [None]:
# this retrieves the value in row position 2, column position 3: Nova Scotia's 2005 emission value
regional_emissions.iloc[2,3]

In [None]:
# this retrieves the value with row label 2, column label "2005": Nova Scotia's 2005 emission value

regional_emissions.loc[2,"2005"]

### Data Selection: Slicing

We can slice in DataFrames like we slice in lists; you specify a column or row to look into, and a range within it to return.

We'll be working with `regional_emissions_index` for this section, so that we can pick out regions by name with `.loc[]` statements.

In [None]:
regional_emissions_index

To get a row:

- `.iloc[]`: first input the row's index number, and then the index numbers of the columns you want

- `.loc[]`: first input the row's index label, and then the index labels of the columns you want

You can select just one row or column, in which case you just input the number/name. Or, you can input a slice statement for the ones you want (around a `:`), or a list, or just a `:` to return everything.




### 1. `.iloc[]` examples

In [None]:
# one specific value

regional_emissions_index.iloc[1, 1] # one value


### 1.1 One row, selecting columns

In [None]:
regional_emissions_index.iloc[1, 1:4] # a slice from a row

In [None]:
regional_emissions_index.iloc[1, [1,2,3]] # a list from a row (equivalent to the slice!)

In [None]:
regional_emissions_index.iloc[1, :] # a whole row

### 1.2 One column, selecting rows

In [None]:
regional_emissions_index.iloc[:, 1] # a column

In [None]:
regional_emissions_index.iloc[1:4, 1] # a slice of rows in one column

In [None]:
regional_emissions_index.iloc[[1,2,3], 1] # a list of rows in one column (equivalent to the slice!)

### 1.3 Multiple rows, multiple columns

In [None]:
regional_emissions_index.iloc[1:4, 2:4] # a set of rows and columns together (with slice)

### 2. `.loc[]` examples

### 2.1 One value

In [None]:
regional_emissions_index.loc["Nova Scotia", "1990"] # one value



### 2.2 One row, selecting columns

In [None]:
regional_emissions_index.loc["Nova Scotia", "1990":"2021"] # a slice from a row
regional_emissions_index.loc["Nova Scotia", ["1990", "2005", "2021"]] # a list from a row (equivalent to the slice!)


In [None]:
regional_emissions_index.loc["Nova Scotia", ["1990", "2005", "2021"]] # a list from a row (equivalent to the slice!)

In [None]:
regional_emissions_index.loc["Nova Scotia", :] # a whole row

### 2.3 One column, selecting rows

In [None]:
regional_emissions_index.loc[:, "1990"] # a column

In [None]:
regional_emissions_index.loc["Prince Edward Island":"New Brunswick", "1990"] # a slice of rows in one column

In [None]:
regional_emissions_index.loc[["Prince Edward Island", "Nova Scotia", "New Brunswick"], "1990"] # a list of rows in one column (equivalent to the slice!)

### 2.4 Multiple columns and rows: with lists

When we did this in `.iloc[]`, we used slices. Here we use lists, to show that works too.

In [None]:
regional_emissions_index.loc[["Prince Edward Island", "Nova Scotia", "New Brunswick"], ["1990", "2005", "2021"]] # a set of rows and columns together (with list)

## `.loc[]` versus `.iloc[]`: Which to use and why?

For general selection of columns, I'd recommend you to stick with `.loc[]`. Column names are much more likely to remain stable than data positioning.


#### Say that column names are changed, affecting `.loc[]`, or positions change, affecting `.iloc[]`. Which is worse?

In my opinion, the latter is **much worse**.

- If column names change, `.loc[]` will throw an error message and break.

- But if positions change, `.iloc[]` will grab whatever's there and keep going without batting an eyelid. This could feed in the wrong data at a later point without you having any idea what's going on!

### In Defense of `.iloc[]`: sorting

One functionality `.iloc[]` has which `.loc[]` definitely doesn't relates to sorting. If you use another function to sort a DataFrame by its values in one column, you can use `.iloc[]` to grab the top X, or bottom X, rows, getting you the rows with the highest or lowest values of that variable.

For example, below we sort 2021 carbon emissions (greatest to least), then take the 5 at the top of the DataFrame. This gives us the 5 provinces with the highest emissions in 2021.  

In [None]:
# get 5 rows with lowest 2021 values

regional_emissions.sort_values("2021", ascending = True).iloc[:5]

In [None]:
# get 5 rows with highest 2021 values

regional_emissions.sort_values("2021", ascending = False).iloc[:5]

# Data Selection: Columns

Simply grabbing one column, or a set of them, is easier than grabbing specific bits of data.

For one column, use the format `df["column"]`. Think of this as telling Python to give you just `"column"` from the dataframe called `df`

For multiple columns, you pass a list of columns, like `df["[column_1", "column_2"]]`

In [None]:
regional_emissions["Region"]

In [None]:
regional_emissions[["1990", "2005", "2021"]]

# Exercise 5:

Create `sector_emissions_2000s`, including only rows of `sector_emisions` with a `Year` of 2000 or greater.

Then, access `sector_emissions_2000s` columns one at a time, and `.sum()` each up, recording these as an identifiable variable.

Lastly, add up all of these values to get the total emissions in Canada from 2000 onwards.

# 6. Filtering


If you're like me, you're probably not paying much attention anymore. The dude up at the front is just going on and on about "locks" and "iLocks".

(Some weird new Apple release?)

But this part is important. A **very** common operation in data cleaning is *"Filtering"*, where you take a large dataset and reduce it to just the samples that fit certain characteristics.


Examples:

- If you've got data on crop returns in Canadian provinces, you might filter to just the samples from Alberta and British Columbia.
- If you've got macroeconomic data, you might filter to just samples from 2000 onwards.
- If you've got a medicinal study, you might filter to just candidates with a certain health condition.

## Can anyone come up with an idea like this - a theoretical dataset, and a criteria you want to filter it by?


### Core idea

There are a few ways to filter data. What we'll be doing revolves around creating a "boolean array", and then selecting rows with it.

A boolean array is created by comparing a logical "True or False" condition with some value in each row. For example, in a dataset on daily weather reports, if the High temperature exceeded 20.0C.

Then, the rest of the expression subsets the dataset to just the rows where this condition is true. We'll work through some examples below.

#### 6.1 One condition

The generalized syntax for this goes as such:

`df[df["column"] <boolean operator> <value>]`

- `df` is the DataFrame you're filtering
- `"column"` is the column whose values you're examining
- `<boolean operator>` is an operator, like `<`, `>`, `==`, `<=`, or `>=`
- `<value>` is just some value

What happens then, is that Python uses the `<boolean operator>` term to compare `<value>` with the row's `"column"` value, returning `True` or `False`. Because it does this for each row, it creates our "boolean array"

Finally, `df` is subsetted to just the rows which line up with a `True` value, while the `False` values disappear.

We'll use some examples to get a feel for this.

In [None]:
regional_emissions[regional_emissions["Shortnam"] == "NL"]

What's going on here?

As you can see, we're comparing the `"Shortnam"` column with a string value.

In [None]:
regional_emissions["Shortnam"]

Because only one of these is equal to `"NL"`, we filter down to just that row: Newfoundland and Labrador.

In [None]:
regional_emissions[regional_emissions["1990"] <= 10.0]

Our next example uses a numeric comparison: it makes a boolean array based on if each row's `"1990"` value is equal or less than 10.0, and returns these.

#### Multiple Conditions

Inside of your `df[]` statement, you can include multiple conditions. These can be joined with either the `&` operator (meaning both must be true for a True entry in the boolean array) or an `|` operator (meaning either being true is sufficient).


It's best to wrap the boolean statements inside of round brackets `()` in order to keep them clearly separated and readable; a line break in between is also good practice.


In [None]:
regional_emissions[(regional_emissions["1990"] < 10) and
                   (regional_emissions["2005"] > 10.0)]

As you can see, the normal `and` operator throws an error.

In [None]:
regional_emissions[(regional_emissions["1990"] < 10) &
                   (regional_emissions["2005"] > 10.0)]

You could also just use two filters in order to get the same result.

In [None]:
regional_emissions_1990_under_10 = regional_emissions[regional_emissions["1990"] < 10]
regional_emissions_1990_under_10_and_2005_above_10 = regional_emissions_1990_under_10[regional_emissions_1990_under_10["2005"] > 10.0]

regional_emissions_1990_under_10_and_2005_above_10

# Legibility - Writability Tradeoff

As you can see, this gets us to the same place. It also illustrates one issue in data cleaning; naming the versions of your dataset.

`regional_emissions_1990_under_10_and_2005_above_10` is descriptive, but it's not convenient to write.

Or say, for that matter.

But a shorter name might lose out on important information for keeping track of what's what. `regional_emissions_filtered` is a lot quicker to write; but what filtering did you do?

Unless you're only doing one filter operation (and you never do just one!) you'll very quickly get confused. I don't have a perfect guide for this, just suggestions:

1. Start your dataframe names small. `reg_df` standing for "regional emissions dataframe" might have been the better choice.
2. If you modify a dataset and don't need to refer to pre-change versions later, don't split off a new version and just make the modifications in place.
3. Figure out what changes you need made to your dataset, and do these immediately in one section so you only have one (or a few distinct) datasets to use for analysis.
4. Not code: keep a legal pad or notebook with you, and write down your different dataset names as you go.

# Exercise 6:

Use `.loc[]` or `.iloc[]` to get all the following subsets of data from `sector_emissions`

1. `"Agriculture"` and `"Heavy Industry` in all years.
2. All `Year` values after 2000.
3. `Oil and gas` and `Transport` values prior to 2010.
4. All columns in 2019, 2020, and 2021.

In [None]:
sector_emissions.tail()

In [None]:
# use .loc[] or .iloc[]



# 7. Column Operations

Another very important part of Pandas programming is column operations: you can select columns in the `df["column"]` format, and perform arithmetic calculations with them, likely to create new columns.

In [None]:
regional_emissions["1990 to 2005 Difference"] = regional_emissions["2005"] - regional_emissions["1990"]
regional_emissions

Here, we subtracted the 1990 values of each province's emissions from the 2005 values, getting the increase or decrease in this time.

In [None]:
regional_emissions["multiplication_example"] = regional_emissions["2005"] * regional_emissions["1990"]

regional_emissions["division_example"] = regional_emissions["2005"] / regional_emissions["1990"]

regional_emissions

As you can see, we can also do multiplication (`*`) and division (`/`). There's no meaningful interpretation to multiplication, though division can show us the percentage change in this time (if we subtract 1 and multiply by 100).



## Exercise 7:

Calculate the percentage change from 1990 to 2005 emissions by taking `"division_example"`, subtracting 1, and then multiplying by 100.

You can just treat the whole `regional_emissions["division_example"]` as one value for this arithmetic; though the values `1` and `100` aren't the same length, for 1x1 dimension values (like scalars) Pandas just applies them to each row in the column.

You can either create a new column for this intermediate step, or wrap the subtraction operation in round brackets `()` to ensure your order of operations is correct.

In [None]:
# Answer


# 8. Summary Statistics

For some quick summary statistics on a DataFrame's numeric features, you can quickly call `.describe()` after your DataFrame object. It will give you:

- count (how many observations)
- mean (average)
- std (standard deviation)
- min (minimum value)
- 25%/50%/75% (quartiles)
- max (maximum value)

These are computed for each column and returned in their own DataFrame as below:

In [None]:
regional_emissions = regional_emissions[["Region", "Shortnam", "1990", "2005", "2021", "1990 to 2005 Difference", "1990 to 2005 percent change"]]



# 9.	Open-ended, time-permitting exercise

Using the techniques we've learned, analyze the `sector_emissions` dataframe. Identify the industries that have seen a decrease in their emissions levels since 2000. Define this however you want!

However, you must select your condition, and **explicitly compare the values in the `sector_emissions` dataframe**. There must be a `True` or `False` value returned when classifying!

I'll be walking around to give help, but you have to come up with the method yourself.

In [None]:
sector_emissions