<a href="https://colab.research.google.com/github/arthursetiawan/test_bench/blob/main/DATA_5100_Class_3_Data_Manipulation_Preclass_Work_Arthur.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions

**Purpose**
The purpose of this assignment is to learn how to use `Pandas` to explore a data set.

**Task**
In this homework, you will analyze data from *Our World in Data* and the *World Bank* on CO$_2$ and other greenhouse gas emissions. The notebook contains a series of questions and tasks that you will answer using `Pandas`. You will watch the tutorial videos on Canvas and follow along to complete the problems here.

You will submit your work as a single notebook. I will use `Restart and run all` to run your code. Therefore, you should not include code that does not work in the middle of the notebook. If you are unable to answer a question and want to show your attempted solution, include the non-working code at the end of the notebook.

# Import libraries

In [None]:
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is a data visualization library built on matplotlib
import seaborn as sns 

# set the plotting style 
sns.set_style("whitegrid")

# Video 1: Importing and exploring a data set

## Import the data set

##### $\rightarrow$ Use Pandas to load the file `owid-co2-data.csv` from https://github.com/owid/co2-data as a `DataFrame`. Name the `DataFrame` `df`.

## View the contents of the data set

##### $\rightarrow$ Use the `.head()` method to view the data set.

## Basic information about the `DataFrame`

##### $\rightarrow$ Use the `.info()` method to show basic information about the data set.

### Column names

##### $\rightarrow$ Use the `columns` attribute to display the column names.

### DataFrame size

##### $\rightarrow$ Determine the size of the `DataFrame` using the shape attribute.

# Video 2: Data Indexing and Selection

## Select columns of the `DataFrame`

### Select one column

We will use several methods to select the column containing CO$_2$ emissions.

#### Use dictionary-style indexing

##### $\rightarrow$ Use dictionary-style indexing to select the column containing CO$_2$ emissions and name it `CO2`.

#### Use explicit array-style indexing with loc

##### $\rightarrow$ Use explicit indexing with the `.loc` method to select the column containing CO$_2$ emissions and display the `head`.

#### Use implicit indexing with `.iloc`

##### $\rightarrow$ Use implicit indexing with the `.iloc` method to select the column containing CO$_2$ emissions and display the `head`.

### Select multiple columns

##### $\rightarrow$ Use a list to select the columns containing the country name, the year, and the CO$_2$ emissions. Name it `df_co2`.

#### Select a range of columns

##### $\rightarrow$ Use a slice and the `.loc` method to select the columns between the country name and the CO$_2$ emissions. 

## Select rows of the `DataFrame`

### Select the first $n$ rows

##### $\rightarrow$ Use the `.head()` method to select the first 3 rows of `df_co2`

##### $\rightarrow$ Use implicit indexing to select the first 3 rows

##### $\rightarrow$ Use explicit indexing to select the first 3 rows

### Select the last $n$ rows

##### $\rightarrow$ Use the `.tail()` method to select the last 3 rows of `df_co2`

##### $\rightarrow$ Use implicit indexing to select the last 3 rows

##### $\rightarrow$ Use explicit indexing to select the last 3 rows

### Select several intermediate rows of the `DataFrame`

View the `head` of the `DataFrame` to highlight the second through fourth rows of `df_co2`

##### $\rightarrow$ Use implicit indexing to select the second through fourth rows

##### $\rightarrow$ Use explicit indexing to select the second through fourth rows

### Select rows using logical indexing

##### $\rightarrow$ Select the rows corresponding to Canada and the United States of `df_co2`

Examine the countries present in the data set

In [None]:
df_co2['country'].unique()

Create a list of the values of the `country` column that are not individual countries

In [None]:
omit_locations = ['Africa', 'Asia', 'Asia (excl. China & India)','EU-27',
                  'Europe', 'Europe (excl. EU-27)',
                  'Europe (excl. EU-28)', 'European Union (27)', 
                  'European Union (28)', 'French Equatorial Africa', 
                  'French Guiana', 'French Polynesia', 'French West Africa',
                  'High-income countries', 'International transport', 
                  'Low-income countries', 'Lower-middle-income countries',
                  'North America',  'North America (excl. USA)', 'Oceania', 
                  'Panama Canal Zone','South America', 'Upper-middle-income countries', 
                  'World']

##### $\rightarrow$ Select the rows of `df_co2` corresponding to individual countries using the `.loc` attribute

# Video 3: Sorting

##### $\rightarrow$ Select the `country` and `co2_per_capita` columns for the year 1964 and the `country` values that are not groups of countries.

## Use numerical summaries to find min and max

### Using the .describe() method

##### $\rightarrow$ Use the `.describe()` method to find the mininum and maximum CO$_2$ emissions per capita in 1964.

### Using aggregation

##### $\rightarrow$ Use the `.agg()` method to find the mininum and maximum CO$_2$ emissions per capita in 1964.

The min and max are computed for each column. Afghanistan and Zimbabwe are not the min and max emitters.

## Sort values to find highest emitters

##### $\rightarrow$ Sort `df_1964` in descending order according to `co2_per_capita` to find the 10 highest CO$_2$ emissions per capita in 1964.

## Sort values to find lowest emitters

##### $\rightarrow$ Sort `df_1964` in ascending order according to `co2_per_capita` to find the 10 lowest CO$_2$ emissions per capita in 1964.

We do not use the tail of the descending list because of `NaN`s

# Video 4: Long/Tidy and Wide data formats

![](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)

https://r4ds.had.co.nz/tidy-data.html

##### $\rightarrow$ View the `head` of the `DataFrame` `df` to determine whether it is in tidy (long) or wide format.

## Import and examine the CO$_2$ emissions data set from the World Bank

We will now examine a different CO$_2$ emission data set from The World Bank. This data set has estimates of CO$_2$ emissions (in metric tons per person) from 1960 to 2018.

You can read more about this data set at https://data.worldbank.org/indicator/EN.ATM.CO2E.PC

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/brian-fischer/DATA-5100/main/CO2Emissions1960to2018.csv')

##### $\rightarrow$ View the `head` of the `DataFrame` `df` to determine whether it is in tidy (long) or wide format.

## Convert from wide to tidy format

##### $\rightarrow$ Use the Pandas `melt` function to convert this `DataFrame` from wide to tidy format by creating one column of the year and one column of CO$_2$ emissions. Name the `DataFrame` `df_tidy`.

##### $\rightarrow$ Use the `.info()` method to determine the data types of the `year` and `co2` columns. Are they the desired types?

### Covert data types if necessary

##### $\rightarrow$ Convert the `year` column to an `int` data type

## Operations with wide and tidy formats

### Numerical summaries

Compute descriptive summary statistics for CO$_2$ emissions each year.

**Wide Format**

##### $\rightarrow$ Use the `.describe()` method to compute summary statistics for CO$_2$ emissions in each year for the `DataFrame` in wide format.

**Tidy Format**

##### $\rightarrow$ Use the `.describe()` method to compute summary statistics for CO$_2$ emissions in each year for the `DataFrame` in tidy format.

### Plots

#### Comparative boxplots

##### $\rightarrow$ Produce comparative boxplots for CO$_2$ emissions in each year for the wide and tidy `DataFrames`.

#### Line plot

**Tidy format**

##### $\rightarrow$ Produce line plots of CO$_2$ emissions over year for Saudi Arabia and the United States for the `DataFrame` in tidy format.