# NumPy and Pandas

In this session, we'll look at:

* Basic Python Datatypes
* NumPy 
    * Creating NumPy arrays
    * Modifying arrays
    * Indexing arrays
    * 5 number summaries with Numpy arrays
* Pandas
    * reading in data
    * dataframe summaries
    * accessing and interacting with our dataframe

## Basic Python Datatypes

We'll start off by having a simple recap of the basic Python datatypes:

* Integers
* Floats
* Strings
* Booleans


### Integers

Integers (or *int*) are a way of storing numerical values in Python. Integers are positive or negative *whole numbers* (no decimal point!)
 



### Floats

Floats are another way to store numerical values. These are positive or negative *real numbers* that are written with a decimal point.


As you might have seen before, we can use the function `type` to find out what data type something is:

In [6]:
type(5)

int

In [2]:
type(1983427)

int

In [3]:
type(5.2492734)

float

In [4]:
type(1983427.0)

float

### Strings

Strings are ordered sequences of characters. Strings are contained within single quotation marks, `' '`, or double quotation marks, `" "`.

In [7]:
str1 = "My name is Claudia"
print(str1)

My name is Claudia


In [8]:
str2 = 'My name is Claudia'
print(str2)

My name is Claudia


Strings are identified by the type `str`:

In [9]:
type(str1)

str

### Booleans

Booleans are a data type that can only take on the values of `True` or `False` (1 or 0). Booleans work with **comparison operators** (equal to, greater than, less than...) or **logical operators** (and, or & not). Passing a statement using one of these operators will return `True` if the condition is met, and `False` if it is not. Let's see some examples:

**Comparison operators**

In [10]:
3 > 5

False

In [11]:
3 < 5

True

In [12]:
9 == 9

True

In [13]:
4 > 12

False

In [14]:
4 != 12

True

**Logical operators**

In [16]:
12 > 5 and 4 > 9

False

In [19]:
3 == 4 or 4 < 1

False

In [20]:
True and False

False

In [21]:
True or False

True

In [22]:
False and False

False

## NumPy

Now, we will look at the Python package 'NumPy' (NUMerical PYthon!). Numpy is a core Python package used for scientific computing and for applying mathematical functions onto data. With Numpy we can create Numpy arrays with our data that we can use these functions on.

### Numpy Arrays

Numpy arrays are multidimensional data structures that can store values of the *same data type* (i.e., only floats, only ints, etc...). This is different to Python **lists**, which can store different data types in one list.

Numpy arrays provide some advantages over python lists:
* They use less memory to store
* They provide faster performance

### Creating a Numpy Array

First, we need to install the Numpy package:

In [23]:
import numpy as np

We can create a Numpy array by passing a list, or a list of lists (depending on the dimensions of your data) to a variable using `np.array` :

In [24]:
my1darray = np.array([1, 2, 3, 4, 5])

my1darray

array([1, 2, 3, 4, 5])

In [25]:
my2darray = np.array([[1, 2, 3], [4, 5, 6]])

my2darray

array([[1, 2, 3],
       [4, 5, 6]])

In [34]:
my3darray = np.array([[[1,2,1], [2,1,3]], [[3,4,5], [4, 3,6]], [[5,6,6], [6, 5,2]]])

my3darray

array([[[1, 2, 1],
        [2, 1, 3]],

       [[3, 4, 5],
        [4, 3, 6]],

       [[5, 6, 6],
        [6, 5, 2]]])

We know we can create arrays with different dimensions, but if it isn't clear from looking at the array like above, we can even check how many dimensions it has using `np.ndim()` :

In [28]:
np.ndim(my2darray)

2

Another useful method for an array, especially larger arrays, is the `np.shape()` method. This shows you the number of elements there are per dimension in the array. For example:

In [29]:
np.shape(my2darray)

(2, 3)

In [32]:
np.shape(my3darray)

(3, 2, 3)

### Modifying arrays

There are two main ways to modify an array: **adding** elements, or **deleting** elements.


#### Adding elements

We add new elements to an array using the `np.append()` function. We can add an entire row or column, by specifying the axis on which we want to append. `axis = 0` indicates a row, while `axis = 1` indicates a column.

We use the general formula `np.append(array, values, axis)`

In [35]:
myarray = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]])

print(np.shape(myarray))
myarray

(3, 5)


array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [36]:
myarray_extrarow = np.append(myarray, [[16, 17, 18, 19, 20]], axis = 0)

print(np.shape(myarray_extrarow))
myarray_extrarow

(4, 5)


array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

In [37]:
myarray_extracol = np.append(myarray, [[1], [2], [3]], axis = 1)

print(np.shape(myarray_extracol))
myarray_extracol

(3, 6)


array([[ 1,  2,  3,  4,  5,  1],
       [ 6,  7,  8,  9, 10,  2],
       [11, 12, 13, 14, 15,  3]])

Note that when adding on a column, each element has to be in a list of its own.

#### Deleting elements

We delete elements from an array using the `np.delete()` function. When deleting elements, we have to specify the **index** at which we want to remove elements.

In [38]:
myarray

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [39]:
myarray_removerow = np.delete(myarray, [0], axis = 0)

myarray_removerow

array([[ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

We can see that we have removed the first row (located at index 0). Let's try it with the first column by simply changing the axis argument:

In [40]:
myarray_removecol = np.delete(myarray, [0], axis = 1)

myarray_removecol

array([[ 2,  3,  4,  5],
       [ 7,  8,  9, 10],
       [12, 13, 14, 15]])

Success!

## <font color = 'lightblue'> Exercise </font>

Using `myarray`, add a new **column** to the array, containing the values `15`, `40` and `65`, and save this new array as `array1`.

Then, delete the last **row** from `array1` and save it as `array2`.

In [None]:
array1 = ...


array2 = ...

### Indexing arrays

We can also locate specific elements in arrays, using slicing and indexes. We specify the index of the element using `[row index, column index]` - remembering to use square brackets. Let's look for the very first element, which is in row index 0, column index 0:

In [41]:
myarray

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [42]:
myarray[0,0]

1

What about in the third column, second row? *(Remember, indexing starts from 0!)*

In [43]:
myarray[1,2]

8

We can also select a subset of rows or columns. We use `:` to indicate a selection of rows/columns - for example `[2:5]` means indices two to **four** - the last index we mention is not included in the subset.

In [44]:
myarray

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [45]:
myarray[0:2]

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

In [46]:
myarray[0:3, 2:5]

array([[ 3,  4,  5],
       [ 8,  9, 10],
       [13, 14, 15]])

We can also use `:` alone to specify 'all rows' or 'all columns'. Notice in the last example, `0:3` was specifying 'all rows'. Let's try replacing that with simply `:` and see if we get the same result:

In [47]:
myarray[:, 2:5]

array([[ 3,  4,  5],
       [ 8,  9, 10],
       [13, 14, 15]])

Success! We can also use it to index all columns:

In [48]:
myarray[0:2, :]

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

Another way we can index arrays is by using **conditions**. Using conditions, we can specify to return only values that meet a particular criteria we have.

If we wanted to return only the values in the array that are greater than 7:

In [49]:
myarray[myarray > 7]

array([ 8,  9, 10, 11, 12, 13, 14, 15])

In this code we are essentially saying "return the values of myarray where the values of myarray are greater than 7". 

What if we wanted to return only even numbers?

In [50]:
4%2

0

In [51]:
7%2

1

In [None]:
100%2

In [None]:
105%2

In [52]:
myarray[myarray % 2 == 0]

array([ 2,  4,  6,  8, 10, 12, 14])

## <font color = 'lightblue'> Exercises </font>

Select the **last two rows** and **first three columns** of `myarray`

In [None]:
myarray[...]

How would you index the **bottom right** element in an array? Try it with `myarray`.

In [None]:
myarray[...]

### 5-number summary Numpy functions

One of the most useful functions of numpy arrays is the built-in summary functions, which are very useful for quickly getting basic summary statistics about your data. These functions include:

* minimum; `np.min()`
* maximum; `np.max()`
* mean; `np.mean()`
* median; `np.median()`
* standard deviation; `np.std()`

Some other useful functions that give summaries of your data are:
* sum; `np.sum()`
* variance; `np.var()`

In [53]:
np.min(myarray)

1

In [54]:
np.max(myarray)

15

In [55]:
np.mean(myarray)

8.0

In [56]:
np.median(myarray)

8.0

In [57]:
np.std(myarray)

4.320493798938574

In [58]:
np.var(myarray)

18.666666666666668

In [59]:
np.sum(myarray)

120

## <font color = 'lightblue'> Exercises </font>

Find the **sum** of a subset of `myarray` containing only the **first two rows** and **first two columns** of the array

Show that the **standard deviation** of `myarray` is equal to the **square root of the variance** of `myarray`

# Pandas

In this section we will look at Pandas dataframes, which is most commonly used for any Python data analysis. We will define what a Pandas dataframe is, show how to create dataframes and how to access them. 

### Importance of Pandas in the World of Data Science

Pandas has become an essential tool in the field of data science because it allows users to easily *read, clean, transform, and manipulate* data from a variety of sources such as CSV, Excel, SQL databases, and more. With Pandas, users can perform complex operations on large datasets efficiently and easily. The operations in Pandas can also serve as a pre-processing step for **machine learning**.

Pandas can also extend into the following industry applications:

<div style="text-align:center;">
    <img src="https://i.imgur.com/887daLJ.jpg " alt="Image Description" width="500" height="400">
</div>

Image cred: [Starship Knowledge](https://starship-knowledge.com/numpy-vs-pandas)

### What is a Pandas Dataframe?
Pandas is a data manipulation tool which is built on the Numpy package. Pandas' key data structure is the `dataframe`. A dataframe allows for the storage and manipulation of tabular data. It is a two-dimensional labelled data structure. 

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Pandas_dataframe.jpg" width="900">

Basically, you could say that the Pandas dataframe consists of three main components: 

- the data, 
- the index and 
- the columns. 


---

### Different ways of creating a dataframe

- list
- Dictionaries
- numpy arrays
- csv file

In [60]:
# read in our data from a csv file

import pandas as pd

emp_info = pd.read_csv('Employee_info.csv')

In [61]:
# Use the .head() function to look at the first 5 rows.
emp_info.head()

Unnamed: 0,ID,Transportation expense,Distance from home to work,Age,Education,Children,Social drinker,Social smoker,Pet,Weight,Height
0,1,235,11,37,3,1,0,0,1,88,172
1,2,235,29,48,1,1,0,1,5,88,163
2,3,179,51,38,1,0,1,0,0,89,170
3,4,118,14,40,1,1,1,0,8,98,170
4,5,235,20,43,1,1,1,0,0,106,167


In [62]:
# Use the .tail() function to look at the last 5 rows.
emp_info.tail()

Unnamed: 0,ID,Transportation expense,Distance from home to work,Age,Education,Children,Social drinker,Social smoker,Pet,Weight,Height
32,33,289,48,49,1,0,0,0,2,108,172
33,34,248,25,47,1,2,0,0,1,86,165
34,35,118,10,37,1,0,0,0,0,83,172
35,36,179,45,53,1,1,0,0,1,77,175
36,37,118,13,50,1,1,1,0,0,98,178


In [63]:
# Use the .describe() function to get an overview of the DataFrame.
emp_info.describe()

Unnamed: 0,ID,Transportation expense,Distance from home to work,Age,Education,Children,Social drinker,Social smoker,Pet,Weight,Height
count,37.0,37.0,37.0,37.0,37.0,37.0,37.0,37.0,37.0,37.0,37.0
mean,19.0,236.945946,27.162162,38.054054,1.351351,1.135135,0.513514,0.189189,1.297297,78.675676,172.945946
std,10.824355,72.975242,14.318817,7.989389,0.753371,1.004494,0.506712,0.397061,2.066463,13.50114,6.275818
min,1.0,118.0,5.0,27.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0
25%,10.0,179.0,15.0,32.0,1.0,0.0,0.0,0.0,0.0,68.0,169.0
50%,19.0,235.0,26.0,37.0,1.0,1.0,1.0,0.0,0.0,76.0,172.0
75%,28.0,289.0,36.0,43.0,1.0,2.0,1.0,0.0,2.0,88.0,175.0
max,37.0,388.0,52.0,58.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0


In [64]:
# Use the .info() function to  print a concise summary of the DataFrame.
emp_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   ID                          37 non-null     int64
 1   Transportation expense      37 non-null     int64
 2   Distance from home to work  37 non-null     int64
 3   Age                         37 non-null     int64
 4   Education                   37 non-null     int64
 5   Children                    37 non-null     int64
 6   Social drinker              37 non-null     int64
 7   Social smoker               37 non-null     int64
 8   Pet                         37 non-null     int64
 9   Weight                      37 non-null     int64
 10  Height                      37 non-null     int64
dtypes: int64(11)
memory usage: 3.3 KB


Let's do a quick recap of the functions we've covered above... 

| Syntax | Description |
| ----------- | ----------- |
| `.head()` | Function to only look at the first 5 records of our data. This is helpful if the dataframe has many rows and loading it will take lots of time. We can also specify the number of rows we want to look at by addid the number in the brackets of the function.
 | `.tail()` | Function to only look at the last 5 records of our data. |
  | `.describe()` | This function is used to generate descriptive statistics of the data in a Pandas DataFrame. It helps in givin more overview of the dataset. |
   | `.info()` | This last function is used to print a concise summary of a DataFrame. This function prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. |

### Accessing Dataframes

Accessing data within dataframes can be done by index, by column, or by both. Let's work through these methods.

#### By Index
To access by index only in a dataframe we can use the `iloc` or `loc` functions with the indices in square brackets. The `iloc` function refers to the index location, so we pass in the number of the index, while the `loc` function refers to the name of the index, so we pass in the index name. Use slicing if you want more than one index. Eg:

* `dataframe.iloc[index i]` - returns series at index i
* `dataframe.iloc[index start: index end]` - returns dataframe from start to end (end not included)
* `dataframe.loc['index name']` - returns series of given index name

Let's look at a few examples:

In [65]:
# Creating dataframe
data = [['John', 42, 1.7, 50], ['Joseph', 15, 1.5, 42], ['James',25, 1.5, 36], ['Jeff', 18, 1.4, 60], 
        ['Yuri', 30, 1.4, 55], ['Chad', 33, 1.7, 56]]

columns = ['Name', 'Age', 'Height', 'Weight in Kg']

index = ['South Africa', 'Zambia', 'Kenya', 'Swaziland', 'Ghana', 'Nigeria']

df_data = pd.DataFrame(data=data, columns=columns, index=index)
df_data

Unnamed: 0,Name,Age,Height,Weight in Kg
South Africa,John,42,1.7,50
Zambia,Joseph,15,1.5,42
Kenya,James,25,1.5,36
Swaziland,Jeff,18,1.4,60
Ghana,Yuri,30,1.4,55
Nigeria,Chad,33,1.7,56


In [66]:
# Select the 4th row using iloc[].
df_data.iloc[3]

Name            Jeff
Age               18
Height           1.4
Weight in Kg      60
Name: Swaziland, dtype: object

In [67]:
# Select rows 2 to 5 using iloc.
df_data.iloc[1:5]

Unnamed: 0,Name,Age,Height,Weight in Kg
Zambia,Joseph,15,1.5,42
Kenya,James,25,1.5,36
Swaziland,Jeff,18,1.4,60
Ghana,Yuri,30,1.4,55


In [68]:
# Select South Africa using loc[].
df_data.loc['South Africa']

Name            John
Age               42
Height           1.7
Weight in Kg      50
Name: South Africa, dtype: object

#### By Column
To access by column only we can simply call `dataframe['Column Name']`. If we want more than one column we input a list of column names inside the square brackets:

* `dataframe['Column Name']` - returns series of given column
* `dataframe[['Column 1', 'Column 2']]` - returns dataframe with the given columns

Let's look at examples.

In [None]:
# Select the column 'Age'.
df_data['Age']

In [None]:
df_data.Age

In [None]:
# Select the columns 'Age' and 'Height'.
df_data[['Age', 'Height']]

#### By index and column
We can also select a subset of the dataframe using indices and columns in combination. 

In [None]:
# Select the first 4 rows and first 2 columns - Rows first.
df_data.iloc[0:4, 0:2]

### Filtering Columns
You can filter for the columns that are of interest and leave out the one's that won't be of value in a particular case.

In [None]:
# Getting the columns needed
emp_info_filtered = emp_info.iloc[2:24]
emp_info_filtered

### Sorting
You can sort_values() Function to sort one of the columns, whether in ascending or descending order.

In [None]:
# sorting the workload in ascending order
emp_info_filtered.sort_values(by = 'Age').head()

In [None]:
# sorting the workload in descending order
emp_info_filtered.sort_values(by = 'Age', ascending = False).head()

### Filtering
A dataframe also allows us to filter the values to only view the data you are interested in. You can either have one or multiple conditions to filter on.

In [None]:
# Filter by people that spend more than 250 for transport
emp_info_filtered[emp_info_filtered['Transportation expense'] > 250]

### Data Cleaning: Handling Missing values or Duplicates

Handling missing values and duplicates is a crucial step in data cleaning for any data analysis project. We can to identify missing values and duplicates, handle them appropriately or remove them from our dataset. This ensures that our data is accurate and ready for analysis, and can help improve the reliability of our results. We will be exploring the following functions:

- `df.isna()`
- `df.duplicated()`
- `df.dropna()`
- `df.drop_duplicated()`

Using the same data from before...

In [None]:
# Check if the data has any missing values
emp_info.isna().sum()

In [None]:
# Check if the data has duplicates
emp_info.duplicated()

In [None]:
# Remove any missing values
emp_info.dropna()

In [None]:
# Remove any duplicates
emp_info.drop_duplicates()