<img src="lbnl_logo.jpg">

----

# Introduction to Numpy and Pandas



---

### Table of Contents


1 - [Using Libraries](#section1)<br>

2 - [Numpy Arrays](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Creating Numpy Arrays](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Basic Operations with Numpy Arrays](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Indexing & Slicing Numpy Arrays](#subsection3)<br>



3 - [Pandas and Dataframes](#section3)<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.1 - [Importing Data & Summary Statistics](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2 - [Indexing &  Slicing in Pandas](#subsection2)<br>


---
## 1. Using Libraries <a id='section1'></a>


In Python you can import libraries with useful features and use them just like you would an app on your iPhone.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib as plt
#import data 

In order to save us some typing time we can give our libraries a shorter alias, like <b>np</b> for Numpy and <b>pd</b> for Pandas. 

---
## 2. Numpy Arrays <a id='section2'></a>

### 2.1 Creating Numpy Arrays  <a id='subsection2'></a>


Numpy is a Python library which allows us to easily process large amounts of numerical data.<br>
<br>
A Numpy array is just a table of data of the same type. In order to use Numpy you can either convert data you already have into a Numpy array or create a blank array from scratch. Numpy arrays and Python lists are similar, yet they react differently to various operations (as you might remember from Week 1, Day 2).

From [NumPy.org](https://numpy.org/doc/stable/user/absolute_beginners.html): Numpy arrays can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

In [2]:
# EXAMPLE

# Create a new list

list_of_numbers = [0, 1, 2, 3, 4, 5]
list_of_numbers

[0, 1, 2, 3, 4, 5]

In [8]:
# EXERCISE

# Verify that it's indeed a regular Python list

...

Ellipsis

In [4]:
# EXAMPLE

#Create a new Numpy array from our list and display it

array_from_list = np.array(list_of_numbers)
array_from_list

array([0, 1, 2, 3, 4, 5])

In [7]:
# EXERCISE

# Verify that it's an array and not a list

...

Ellipsis

Usually, we get arrays from our data. If we don't yet have any data or just want a placeholder for our data, we can create a Numpy list filled with ones by calling np.ones() and specifying the size of the array in a tuple, (3, 4).

In [6]:
# EXAMPLE

#Create an array of size 3x4 and fill it with ones

ones_array = np.ones((3, 4))
ones_array

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

We can also create arrays filled with zeros.

In [None]:
# EXERCISE

# Create an array of zeroes
# It will have a similar syntax, but instead we need to call 
# a function "zeros", not "ones" from our Numpy library

zeros_array =  ...  
zeros_array

Similarly we can create Numpy arrays filled with any other value. To accomplish this we use np.full and specify both the size of the array and the value we would like to fill that array with.

In [None]:
# EXAMPLE

array_of_twos = np.full((4, 3), 2)
array_of_twos

In [9]:
# EXERCISE

# Create an array of halves: "0.5" or "1/2"

array_of_halves = ...
array_of_halves

Ellipsis

As we have learnt in Week 1, Day 3, we usually don't need to save values like Pi, we can just import it from a library. In the cell below, create a 4x3 array filled with Pi numbers. 

**Hint:** You can either use math or numpy libraries to import the value of pi.

In [None]:
# EXERCISE

array_of_pies = np.full((4,3), np.pi) #SOLUTION
array_of_pies

If we want Numpy to fill the array with random numbers between zero and one, we can use np.random.rand().
Confusingly, for this function we do not need to put the size of the array in a tuple. Instead we just give Numpy the dimensions of the array directly.  

In [None]:
# EXAMPLE

np.random.rand(2,3)

### 2.2 Basic Operations with Numpy Arrays  <a id='subsection2'></a>


Of course, Numpy is much more than just an array creator. It allows us to do blazingly fast operations with arrays. Operations performed with Numpy on arrays can be computed significantly faster than with other Python functions on lists.
For example, let's say that we have an array of one million random probabilities of it raining on a particular day.

In [None]:
# EXAMPLE

random_million = np.random.rand(1000000, 1)
random_million

Let's check that the array is actully one million numbers long. 

**Hint:** Just like with most other data structures (lists, tuples, etc.) and some data types (strings), you can use the traditional Python len( ) function to check the *length* of your object.

In [10]:
# EXERCISE

...

Ellipsis

Yep, that checks out.<br> <br>Now let's say that we want these probabilities to be in percentages (out of 100 rather) than proportions (from zero to one). We can just multiply the whole array by 100!

In [None]:
# EXAMPLE

percentages = random_million * 100
percentages

Notice that Numpy accomplishes that multiplication in a fraction of a second when we use it with arrays. That's a million multiplications! In the cell below you can see how much longer the code is for doing the same operations on a list. 
**Note:** we cut our list to have 10,000 values only, because it can take it a long time to run a for-loop over 1 mln values. 

In [None]:
# EXAMPLE

random_mln_lst = list(random_million[:10000])
percentages_lst = [] 

for i in random_mln_lst:
    percentages_lst += [i*100] 
    
print(percentages_lst)    

 Let's see if we can get Numpy to at least break a sweat doing multiplications.

In [None]:
# EXAMPLE

# 100 million multiplications
np.random.rand(100000000) * 100

Yeah, Numpy is relly fast! Not to mention that it first needs to come up with the random numbers in the array, and only then can it do the multiplications we are asking it to do. That's pretty useful if you want to analyze a huge amount of data!

Let's see what other operations we can do with Numpy arrays.

Can it add the same number to all the elements of our array? How about subtracting it? Even dividing by it?

In [None]:
# EXAMPLE

plus_fifty = random_million + 50
plus_fifty

How about if we want to divide each value by 2?

In [11]:
# EXERCISE

divided_by_two = ...
divided_by_two

Ellipsis

We can even do those arithmetic operations between two arrays if they are of the same size!

In [None]:
# EXAMPLE

sum_of_arrays = divided_by_two + plus_fifty
sum_of_arrays

Think of what would happen if we tried to add two lists?

### 2.2 Indexing & Slicing Numpy Arrays  <a id='subsection2'></a>


Just like we did with Python lists, if we ever need to retreive a value at a particular index in a Numpy array, we can use [num:num] to get it.

In [None]:
# EXAMPLE

print(array_from_list)
print("Value at index 0 is:", array_from_list[0])

Try to retrieve a value at index **3** of our array.

In [None]:
# EXERCISE

print("Value at index 3 is:", ...) 

We can also get a "slice" of numbers just like we would from a list

In [None]:
# EXAMPLE

array_from_list[2:5]

Now how would you return all the values starting with index 2 (skipping values at indices 0 and 1)?

In [None]:
# EXERCISE

...

You can think of arrays as tables. If your array has more than one column per row, we just use a comma between the index of the first dimension (row) and the index of the second dimension (column). The indexing and slicing works exactly the same as before, but we can do it separately for rows and columns.

In [None]:
# EXAMPLE

three_by_five = np.random.rand(5, 3)
three_by_five

In [None]:
# EXAMPLE

three_by_five[0, 0]

Now let's try to retrieve a value from the bottom right.

**Hint:** Remember, first we input rows, then columns. Also, don't forget that Python starts counting from *zero*.

In [None]:
# EXERCISE

...

Just like with regular arrays and lists, we can slice arrays that have multiple columns. The same rules apply: first we input the desired rows, then the desired columns. In the cell below, we are asking for our array to output values one through four (our array has only 2 columns, but it won't error) in the very first row. 

In [None]:
# EXAMPLE

three_by_five[0, 1:4]

In [None]:
# EXAMPLE

three_by_five[1:, :1]

Now how would we output only the last values of all rows?

In [None]:
# EXERCISE

...

---
## 3. Pandas and Data Frames <a id='section3'></a>

### 3.1 Importing Data & Summary Statistics  <a id='subsection3'></a>

We will use the function `read_csv()` in the **Pandas** library to import and read our data. The _csv_ at the end of the function tells the program to read a comma-delimited file. However, there are many types of delimiters such as tab, semicolon, pipe, etc. 

We will now read a the _iris.csv_ csv as a **DataFrame** and store it in a variable called _iris_.

In [None]:
# EXAMPLE
# save iris.csv from your folder 
# as a dataframe under a variable iris

iris = pd.read_csv('iris.csv')

Great! Now let's explore our data set. 

We will begin by using the method (or function)  `.head()`. By default, it will show the first 5 rows of or data set, but you can tell it to display the first n results by _passing n as an argument to `.head()`.

In [None]:
# EXAMPLE

iris.head()

You can also see the last _n_ rows of our data using the method `.tail()`.

In [None]:
# EXAMPLE

iris.tail()

`DataFrames` contain rows and columns. You can think of them as Excel sheets. If you want to understand the structure of your DataFrame, there a few functions and attributes that might come handy. 

These include
* `shape`: outputs n rows and n columns
* `columns`: outputs names of columns
* `index`: outputs the indices in a format of (start, stop, step)
* `info()`: outputs info per each column, very useful for retrieving an index of each column, checking the format of data in each column (sometimes numbers can be in a form of a string and prevent you from running your calculations properly), it also shows you the number of Null (or missing) values per each columm.
* `describe()`: outputs basic statistics per each column like mean/median/mode, etc.
* `len()`: just like with other data structures, we can use len( ) with DataFrames. 

In [None]:
# EXAMPLE

iris.shape

The iris DataFrame contains 150 rows and 5 columns.

In [None]:
# EXAMPLE

iris.columns

In [None]:
# EXAMPLE

iris.index

In [None]:
# EXAMPLE

iris.info()

As with lists and arrays, you can also use the function `len()` to see how many elements (in this case rows) our data set contains.

In [None]:
# EXAMPLE

len(iris)

Another cool method is `.describe()`. Describe provides you with some basic statistics about each of the variables in your DataFrame including measures for tendency, dispersion and shape of a
dataset's distribution, excluding **NaN** values.
* By default, it will return the summary statistics of the numeric columns, but it can also work with mixed data. If the method is called on strings it will return measures such as the count, number of unique values, and the most frequent value.

In [None]:
# EXAMPLE

iris.describe()

### 3.2 Indexing &  Slicing  <a id='subsection2'></a>

There are two main ways of indexing through DataFrames. We will still use our old friend, the square brackets [ : ], but now we will need help of two functions: **loc** and **iloc**.

**loc**: uses names or labels of rows and columns.
**iloc**: uses indices of rows and columns. You can think of *iloc* as *index-loc*.


#### .loc[rows-label(s), columns-label(s)]
`.loc` Helps us view and index our DataFrame. 
* It works with string labels. Notice that most of the times you will have specific column names, but our row names often come as a number. Hence the label of the rows will be a number.   
* It can take 
    * one label __(df.loc[row-label, 'col-label-1'])__
    * a list of labels __(df.loc[[row-label 1, row-label-2, row-label-4],['col-label-1',  'col-label-2', 'col-label-4']])__
    * or a _slice_ of labels __(df.loc[row label-50 : row-label-100,'col-label-1': 'col-label-8'])__


#### Rows

Let's use loc to see what are the values in row 10 in our DataFrame

In [None]:
# EXAMPLE

iris.loc[10]

* Notice that if our rows were labeled with textual information, we would have to use that name instead of "10". In this case the label for the 10th row is indeed 10. 

What if we want to see what are the values in row 5, 10, and 15? Let's pass 5,10, 15 into `loc` as a list of values. 


In [None]:
# EXAMPLE

iris.loc[[5,10,15]]

This returned a `DataFrame` whereas the first returned a `series`. This is because on this one we selected a range of values. 

How would you use loc to see what are the values of rows 10-20? Yes, you can use a list like in the example above, but it can be quite cumbersome to have to type each number from 10 - 20. There is a better way, and this is slicing, just like we did with arrays and lists. 

In [None]:
# EXERCISE

iris.loc[10:20] #SOLUTION

#### Columns 

Great! Now that you know how to index rows, let's see how we can index columns. Don't forget that we are still using `loc`, so we will have to use column labels.

Let's begin by indexing by one column, variety. Let's output all rows for this column.

In [None]:
# EXAMPLE

iris.loc[:,'variety']

Another way to index by only one column is by adding the column label in a list. 

In [None]:
# EXAMPLE

iris.loc[:,['variety']]

DO you notice the difference in output? The first output returned a `series` (another type of a data structure), and the second returned a one-column `DataFrame` because we passed a list.  

Notice that here we had to specify the range of rows that we want to index that column by. We used `:` in order to return all values in the column.

Now, let's index by more than one column. Just as before we will use a list containing our desired column labels. 

In [None]:
# EXAMPLE

iris.loc[:,['sepal.length', 'sepal.width','variety']]

Just as we sliced rows, we can do the same with column. In the cell below, return all rows for columns *sepal.length* through *petal.width* (inclusive of the last column).

In [None]:
# EXERCISE

...

#### .iloc[rows_index, columns_index]

Another way to index is using `.iloc`. As was mentioned above, `iloc` allows us to index using integer positions, instead of names and values of our rows and columns.

#### Rows

In [None]:
# EXAMPLE

iris.iloc[[1,3,6,8,9]]

Recall the __start:stop:step__ from lists? Well we can also select a range of rows with a specified step value in our data DataFrame. In here we will take every 5th element from the 50th row to the 150th row. 

In [None]:
# EXAMPLE

iris.iloc[50:150:5]

#### Columns

As we mentioned before `iloc` works just as `loc`, but instead of using labels we use the index. Let's get all the rows in the fifth column. Don't forget that we are starting at the 0th index.

In [None]:
# EXAMPLE

iris.iloc[:,4]

How would you prompt it to return a one-column `DataFrame` (aka table) instead of a `series`?

In [None]:
# EXERCISE

...

---
Notebook developed by: Kseniya Usovich