<div class="alert alert-block alert-info">
Singapore Management University<br>
CS105 Statistical Thinking for Data Science
</div>

# Lab 1: Python for Data Science

>#### Table of Contents
>
>- Introduction to NumPy
>    - Arrays
>    - Aggregation functions
>    - Copy vs assignment
>- Introduction to pandas
>    - Series
>    - DataFrame
>    - Selection, slice and dice
>    - Aggregation functions
>    - Grouping data
>    - Defining new columns
>

## 1 Introduction to NumPy

NumPy is a useful Python library to facilitate numerical computation.  It comes with built-in classes and implementations of common mathematical routines.  This allows you to express a problem more succinctly, saving you from writing native Python codes.  For example, it comes with built-in classes for $n$-dimensional array, random number generator, as well as efficient matrix multiplication, to name a few.

Almost everyone imports numpy like this:

In [1]:
import numpy as np

There's also an official reference: https://numpy.org/

### 1.1 Arrays

A fundamental object is the NumPy array.  It is a collection of data arranged in one or more dimension.  In this section, we primarily focus on dealing with one-dimensional array.

You can create a one-dimensional array by passing in a list

In [2]:
a = np.array([3, 1, 4, 1, 5, 9, 2, 6])
a

array([3, 1, 4, 1, 5, 9, 2, 6])

To get the number of elements in an array, use the `size` method

In [3]:
a.size

8

You can generate a sequence of equally spaced numbers using `arange` or `linspace`

For `arange`, you need to specify the spacing between the numbers

For `linspace`, you instead specify how many elements you want and NumPy would generate them with equal spacing 

In [4]:
a = np.arange(0, 5, 0.5)
a

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [4]:
a = np.linspace(0, 5, 6)
print(a)
print(f"Size of array: {a.size}")

[0. 1. 2. 3. 4. 5.]
Size of array: 6


Sometimes you want to initialise all elements to be zero or randomly

In [5]:
a = np.zeros(10)
a

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [7]:
a = np.random.random(10)
a

array([0.71644044, 0.21340482, 0.68709053, 0.66558949, 0.61896857,
       0.87345624, 0.75692569, 0.35627041, 0.7573401 , 0.37157505])

You can also filter based on specific conditions.

In [8]:
a = np.arange(12)
a[a % 2 == 0]

array([ 0,  2,  4,  6,  8, 10])

Use ``&`` for **and** operation and ``|`` for **or** operation.

In [9]:
a[(a > 3) & (a < 10)]

array([4, 5, 6, 7, 8, 9])

In [10]:
a[(a % 2 == 0) | (a % 3 == 0)]

array([ 0,  2,  3,  4,  6,  8,  9, 10])

### 1.2 Aggregration functions

`max`, `min`, `mean` work just as expected.

In [11]:
a = np.random.randint(low=0, high=100, size=10)
a

array([16, 86, 61, 57, 41, 45, 74, 10, 26, 45])

In [12]:
np.max(a)

86

In [13]:
np.min(a)

10

In [14]:
np.mean(a)

46.1

You can sort in-place, by calling the `sort` method, available from the array class

In [15]:
a = np.random.randint(low=0, high=10, size=10)
print("before sorting", a)
a.sort()
print("after sorting", a)

before sorting [0 0 4 9 8 4 0 5 7 0]
after sorting [0 0 0 0 4 4 5 7 8 9]


Or use `NumPy.sort` to create a new sorted copy.

In [16]:
a = np.random.randint(low=0, high=10, size=10)
print("a original", a)
b = np.sort(a)
print("a after sorting", a)
print("b after sorting", b)

a original [5 6 0 7 2 5 3 8 6 2]
a after sorting [5 6 0 7 2 5 3 8 6 2]
b after sorting [0 2 2 3 5 5 6 6 7 8]


### 1.3 Copy vs assignment

Everything is an object in Python.  Further, an object can be mutable or immutable.  For example, a tuple is immutable whereas a NumPy array is mutable.  More importantly, for a mutable object, assignment (i.e. `=`) does *not* create a new copy.  Instead, when you do assignment in the case of mutable object, you merely create a reference to the object

Below, since array is mutable, when `b=a` is run, `b` and `a` are referencing to the same object. Hence, we can use either `a` or `b` to modify the (same) array object. Here we modify via reference `b`

In [17]:
a = np.arange(6)
b = a
b[0] = 10
a

array([10,  1,  2,  3,  4,  5])

As such, we need to be careful when dealing with arrays. If you need to keep the elements of the original array, you have to create a copy first as shown below

In [18]:
a = np.arange(6)
b = a.copy()
b[0] = 10
a

array([0, 1, 2, 3, 4, 5])

## 2 Introduction to pandas

Pandas is an important Python library to facilitate data manipulation.  Contrast that to NumPy which focuses on numerical routines.  The fundamental building blocks are pandas `Series` and `DataFrame`.  Let's import pandas before diving into the details.

In [6]:
import pandas as pd

Official reference: https://pandas.pydata.org/

### 2.1 Series

Similar to `list`, `Series` are one-dimensional containers.  The key difference is that you can label the data with names in `Series`

In [7]:
temps = pd.Series([29, 32, 31], index=["mon", "tue", "wed"])
temps

mon    29
tue    32
wed    31
dtype: int64

Retrieve a value based on the named index

In [21]:
temps["tue"]

32

### 2.2 DataFrame

A `DataFrame` is an Excel-like, column-based data structure.  This is one of the most commonly used data structures in data science modeling, reason being that many of the data sources are naturally fitted into a column-based format.

We have placed a dataset named `iris.csv` in the same directory as this notebook.  This is a dataset which gives the variety of a flower type along with some attributes like petal length, width etc.

In many data science problems, your dataset is usually already being prepared in format such as `xls` or `csv`.  As a first step, you will need to import your data into a `DataFrame` like below.  Here you load the dataset and call it `df` which is short for `DataFrame`.

In [8]:
df = pd.read_csv("iris.csv")

There are several ways to have a quick sense of what the data looks like

In [9]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
0,5.7,2.8,4.5,1.3,Versicolor
1,5.7,2.6,3.5,1.0,Versicolor
2,6.0,3.4,4.5,1.6,Versicolor
3,5.0,2.0,3.5,1.0,Versicolor
4,5.6,2.8,4.9,2.0,Virginica


In [10]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [11]:
df.shape

(150, 5)

### 2.3 Selection, slice and dice

You can select columns by their labels, i.e. names.

In [26]:
df1 = df[["sepal_length", "petal_length", "variety"]]
df1.head()

Unnamed: 0,sepal_length,petal_length,variety
0,5.7,4.5,Versicolor
1,5.7,3.5,Versicolor
2,6.0,4.5,Versicolor
3,5.0,3.5,Versicolor
4,5.6,4.9,Virginica


 To select based on index positions, use `iloc`. Here we select data across all rows, along column 1 and 3 

In [27]:
df1 = df.iloc[:, [1,3]]
df1.head()

Unnamed: 0,sepal_width,petal_width
0,2.8,1.3
1,2.6,1.0
2,3.4,1.6
3,2.0,1.0
4,2.8,2.0


We can also select rows 10 to 15 (excluding row 15) along all columns

In [28]:
df.iloc[10:15]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
10,5.0,3.6,1.4,0.2,Setosa
11,5.9,3.0,5.1,1.8,Virginica
12,5.5,2.5,4.0,1.3,Versicolor
13,7.7,2.6,6.9,2.3,Virginica
14,5.5,4.2,1.4,0.2,Setosa


You can also apply multiple conditions to select the rows.

In [29]:
df1 = df[(df.sepal_length > 5) & (df.variety == "Setosa")]
df1.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
5,5.7,4.4,1.5,0.4,Setosa
14,5.5,4.2,1.4,0.2,Setosa
15,5.7,3.8,1.7,0.3,Setosa
19,5.2,4.1,1.5,0.1,Setosa
21,5.4,3.7,1.5,0.2,Setosa


### 2.4 Aggregation functions

As expected you can apply `min`, `max`, `sum`, `mean` on the columns.

In [30]:
df.sepal_length.min()

4.3

In [31]:
df.petal_width.mean()

1.1993333333333331

You can count the non-Null values.

In [32]:
df.variety.count()

150

Sometimes you want the unique count which can be done using `nunique`.

In [33]:
df.variety.nunique()

3

Get the actual unique values themselves as follow.

In [34]:
df.variety.unique()

array(['Versicolor', 'Virginica', 'Setosa'], dtype=object)

You can sort the column. In-place sorting is available.

In [35]:
df.sort_values(by=["variety", "sepal_length"], ascending=[False, True], inplace=True)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
139,4.9,2.5,4.5,1.7,Virginica
4,5.6,2.8,4.9,2.0,Virginica
33,5.7,2.5,5.0,2.0,Virginica
43,5.8,2.8,5.1,2.4,Virginica
83,5.8,2.7,5.1,1.9,Virginica


### 2.5 Grouping data

You can aggregate data with `groupby`. Here we are grouping by the `variety` to get the average values for `petal_length` and `sepal_length`

In [36]:
df.groupby(by=["variety"])[["petal_length", "sepal_length"]].mean()

Unnamed: 0_level_0,petal_length,sepal_length
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Setosa,1.462,5.006
Versicolor,4.26,5.936
Virginica,5.552,6.588


More generically, use `agg` if you want to apply different aggregation functions on the different columns

In [37]:
df.groupby(by=["variety"]).agg({"sepal_length": ["min", "max"], 
                               "petal_length": ["mean", "median"]})

Unnamed: 0_level_0,sepal_length,sepal_length,petal_length,petal_length
Unnamed: 0_level_1,min,max,mean,median
variety,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Setosa,4.3,5.8,1.462,1.5
Versicolor,4.9,7.0,4.26,4.35
Virginica,4.9,7.9,5.552,5.55


### 2.6 Defining new columns

You can create new column out of other columns

In [38]:
df["length_ratio"] = df.sepal_length / df.petal_length
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety,length_ratio
139,4.9,2.5,4.5,1.7,Virginica,1.088889
4,5.6,2.8,4.9,2.0,Virginica,1.142857
33,5.7,2.5,5.0,2.0,Virginica,1.14
43,5.8,2.8,5.1,2.4,Virginica,1.137255
83,5.8,2.7,5.1,1.9,Virginica,1.137255


More generically, you can use `apply` to send every row to a function. 

Below we have a function that takes each `row` and computes the absolute difference between `sepal_width` and `petal_width` - calling it `width_diff`. Note that the this operation could be slow

In [39]:
df["width_diff"] = df.apply(lambda row: abs(row.sepal_width-row.petal_width), axis=1)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety,length_ratio,width_diff
139,4.9,2.5,4.5,1.7,Virginica,1.088889,0.8
4,5.6,2.8,4.9,2.0,Virginica,1.142857,0.8
33,5.7,2.5,5.0,2.0,Virginica,1.14,0.5
43,5.8,2.8,5.1,2.4,Virginica,1.137255,0.4
83,5.8,2.7,5.1,1.9,Virginica,1.137255,0.8


You may also cast the entire column using `astype`. Here we convert `sepal_length` which is originally a float to an int.

In [40]:
df.sepal_length = df.sepal_length.astype(int)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety,length_ratio,width_diff
139,4,2.5,4.5,1.7,Virginica,1.088889,0.8
4,5,2.8,4.9,2.0,Virginica,1.142857,0.8
33,5,2.5,5.0,2.0,Virginica,1.14,0.5
43,5,2.8,5.1,2.4,Virginica,1.137255,0.4
83,5,2.7,5.1,1.9,Virginica,1.137255,0.8
