## `Imports`
As per usual, we start by importing the packages that we will be using later. It's generally a good practice to do so at the top of a file. 

    If you have troubles importing any of these packages, make sure they are properly installed (README).

In [1]:
import os
import sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# ============ Pandas  ============

Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the **Series** and **DataFrame** structures which are built on top of NumPy arrays.

The following introduction is largely based on this [tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/). Another useful referece is the [Pandas introduction to data structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html). Pandas is well documented and you will find good information about all methods and structures in the [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html)

### Series

A **Series** a one-dimensional object (similar to a vector). Each element has a corresponding *index*. By default the indices range from 0 to N, where N is the length of the Series.

In [None]:
# Let's create a Series by passing in a list without specifying the indices.
s = pd.Series([1, 4.2, 7.6, 16.2])
s

In [None]:
# Now, let's specify the indices explicitly
s = pd.Series([1, 4.2, 7.6, 16.2], index=['A', 'B', 'C', 'D'])
s

In [None]:
# Indexing the Series
s['B']

In [None]:
# We can also index by using boolean logic
s[s > 2]

### DataFrame

A DataFrame is a tabular data structure comprised of rows and columns. You can also think of the DataFrame as a collection of Series objects that share an index. 

#### Creating DataFrame structures

We can create an empty DataFrame by specifying the column names. Then we can insert data row by row.

In [None]:
import pandas as pd
df = pd.DataFrame(columns=['Gender', 'Age', 'Height', 'Weight'])
df

In [None]:
# Now let's add an observation
df.loc[0] = ['Male', 23, 180, 73]  # Note how we used .loc to specify the index
df.loc['A'] = ['Female', 27, 167, 59]
df

You can populate using a dictionary too which allows you to do things in a nonstandard order...

In [None]:
df.loc['i'] = dict(Weight='3kgs', Age=10, Gender='Blue', Height=-12)
df

#### Creating DataFrame from other structures

You can also create a dataframe from:
* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

For example:

In [None]:
# Create a DataFrame from a list
some_list = [['Male', 23, 180, 73], ['Female', 27, 167, 59]]
df = pd.DataFrame(some_list, index=[0, 'A'], columns=['Gender', 'Age', 'Height', 'Weight'])
df

In [None]:
# Create a DataFrame from a dictionary where keys are column values
column_key_dict = {
    'Gender': ['Male', 'Female'],
    'Age': [23, 27],
    'Height': [180, 167],
    'Weight': [73, 59]
}
df = pd.DataFrame.from_dict(column_key_dict, orient='columns')
df.index = [0, 'A']
df

In [None]:
# Create a DataFrame from a dictionary where keys are index values
index_key_dict = {0:['Male', 23, 180, 73], 'A':['Female', 27, 167, 59]}
df = pd.DataFrame.from_dict(index_key_dict, orient='index')
df.columns = ['Gender', 'Age', 'Height', 'Weight']
df

In [None]:
# Using the DataFrame call, keys are assumed to be column headers
df = pd.DataFrame({0:['Male', 23, 180, 73], 'A':['Female', 27, 167, 59]}, 
                   index=['Gender', 'Age', 'Height', 'Weight'])
df

In [None]:
# ...we can transpose using the `.T` method

In [None]:
df = df.T
df

#### Loading a CSV into a DataFrame

Most commonly we create DataFrame structures by reading csv files. To run the following piece of code you need to download the datasets associated with the course and place them in a subdirectory called "datasets" under the same directory that your notebooks are located. Alternatively, you can specify the full path of the .csv file.

In [None]:
cpu_loc = os.path.join(os.getcwd(), 'data', 'cpu.csv')
cpu_loc

In [None]:
cpu = pd.read_csv(cpu_loc)
cpu.head() # Head shows the first few elements (unless specified otherwise) of the DataFrame

You should see that each observation in our dataset comprises 8 measurements (attributes).

#### Basic methods for DataFrame objects
* `head(N)`: displays the first N elements of the DataFrame
* `tail(N)`: displays the last N elements of the DataFrame
* `info()`:  displays basic information about the variables
* `describe()`: displays summary statistics of the data

Execute the following cells and observe the outputs.

In [None]:
cpu.tail(5)

In [None]:
cpu.info()

In [None]:
cpu.describe()

#### Column Selection

You can think of a DataFrame as a group of Series that share an index (in this case the column headers). This makes it easy to select specific **columns**.

In [None]:
cpu['MMAX'].head(5)

In [None]:
type(cpu['MMAX'])

To select multiple columns we simple need to pass a list of column names. The resulting object is another DataFrame.

In [None]:
cpu[['MMIN', 'MMAX']].head(7)

In [None]:
type(cpu[['MMIN', 'MMAX']].head(7)) # This is a DataFrame

#### Row selection

To select specific **observations (i.e. rows)** we need to pass in the corresponding indices. This operation is called *slicing*. The resulting structure is again a DataFrame.

In [None]:
cpu[0:3]

In [None]:
# This is equivalent to using .iloc
cpu.iloc[0:3]

#### Filtering

Now suppose that you want to select all the observations which have an MMAX value which is higher than 35000. It is easy to do that:

In [None]:
cpu[cpu['MMAX'] > 35000]

Or equivalently:

In [None]:
cpu[cpu.MMAX > 35000]

You can also filter the data by using multiple attributes:

In [None]:
cpu[(cpu.MMAX > 35000) & (cpu.MMIN > 16000)]

We saw before how we can select rows by passing the index numbers. This most of the time works but very often our indices are not in linear ascending order. 

There are two basic methods of indexing DataFrame structures:
* `loc`: works on labels in the index
* `iloc`: works on the position in the index (so it only takes integers)

The following example should clarify the difference between label-based indexing (`loc`) and positional indexing (`iloc`)


In [None]:
# First let's create a new dataframe
cpu_new = cpu[cpu['MMAX'] > 35000]
cpu_new

In [None]:
cpu_new.loc[8:10] # Looks for the rows which are labelled 8 and 9

In [None]:
cpu_new.iloc[0:2] # Look for the first and second rows (this yields the same result as before)

In [None]:
# If we try the following we will get an empty DataFrame because there are no rows with labels 0 and 1.
cpu_new.loc[0:2]

In [None]:
# The result is another DataFrame
type(cpu[0:2])

A very common scenario will be the following. We want to select specific observations and columns of a DataFrame and convert to a NumPy array so that we can use it for feature extraction, classification etc. This can be achieved by using the `values` method.

In [None]:
# Select the first 10 observations and the "MMIN" and "MMAX" columns only and convert to numpy array.
cpu[:10][['MMIN', 'MMAX']].values

You can confirm that by using the `values` method the resulting object is a NumPy array.

#### Indexing - selecting rows and columns

*WARNING* - indexing is probably the most difficult part of pandas to get used to. If you get stuck [refer to the documentation on indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html).

Summary of DataFrame methods for indexing:
* iloc - ignore index labels, index like numpy with integer positions
* loc - use index labels

To illustrate, observe what happens when we reorder the rows of our dataframe.

In [None]:
cpu.sort_values('ERP', inplace=True)

In [None]:
cpu.iloc[:10]

In [None]:
cpu.loc[:10]

Observe what happens if we change the label of one of the now first index

In [None]:
cpu = cpu.rename(index={cpu.index[0]: 'A'})

In [None]:
cpu.iloc[:10]

In [None]:
try:
    cpu.loc[:10]
except TypeError as e:
    print(e)

#  Exercises

#### ========== Question 1 ==========
Load the `credit` dataset and display the dataset basic information.

In [1]:
# Your code goes here

#### ========== Question 2 ==========
Display the summary statistics of the attributes of the dataset.

In [2]:
# Your code goes here

#### ========== Question 3 ==========
Display the last 6 instances of the dataset.

In [3]:
# Your code goes here

#### ========== Question 4 ==========
Print the 5th observation

In [4]:
# Your code goes here

#### ========== Question 5 ==========
Print the standard deviation of the attribute `CreditAmount` by using the numpy method `std`. You can verify you get the correct result by inspecting the statistics table from Question 19.

In [5]:
# Your code goes here