# Python coding bootcamp - Notebook 2

- Introduction to **NumPy**
- Introduction to **pandas**

&copy; 2024 Francis WOLINSKI

<div class="alert alert-danger">
    <h3><i class="fa fa-plus-square"></i>  PART 1</h3>
</div>

![image](images/numpylogo.png)

# Introduction to NumPy

1. **NumPy**, the origins of data processing in Python
2. **NumPy** and arrays
3. Accessing and modifying values
4. Broadcasting
5. Universal functions
6. Exemples with pictures
7. Saving and loading arrays

# 1. NumPy, the origins of data processing in Python

- **NumPy** is the first data processing package in Python.
- It is based on a set of functions coded in the C language
- It combines a class, ndarray, and universal functions
- The foundation of most data science packages

**Numpy** uses arrays
- a one-dimensional array is a vector
- a two-dimensional array is a matrix
- an array is equivalent to a tensor

Arrays are used to work on unstructured data
- an image is represented by a 3-dimensional array
- a video is represented by a 4-dimensional array

In [None]:
# import

import numpy as np

# 2. NumPy and arrays

- Arrays are central structures in data science. 
- **NumPy** arrays are used in the same way as vectors or matrices.
- Arrays are created using the function `np.array()`, and can be created from a list or several lists.
- Arrays only consist of one data type (in its classic form)

### A little vocabulary

- Dimensions are called `axis` (`axis=0`: rows, `axis=1`: columns)
- The number of dimensions is accessed with `.ndim`.
- The shape is accessed with `.shape`.
- Size (`.size`) is the total number of elements in an array
- Element type is accessed with `.dtype`.

Arrays can be created in a number of ways:

function (extract)|use
-|-
array|from an array object
arange|vector of numbers equally distributed in an interval (step)
linspace|vector of numbers equally distributed in an interval (number)
zeros|returns a *ndarray* null
zeros_like|returns a null *ndarray* with the same dimensions as another *ndarray*.
ones|returns one *ndarray* unit
ones_like|returns a unit *ndarray* with identical dimensions to another *ndarray*.
eye|returns a null matrix with 1's on the first diagonal
identity|return an identity matrix
full|return a matrix with a uniform value

**Explicit creation**

In [None]:
# an array

arr1 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr1)

In [None]:
# an array

print(arr1.ndim, arr1.shape, arr1.size, arr1.dtype)

**np.arange function**

In [None]:
# arange

array = np.arange(48)
array

In [None]:
# ndim

array.ndim

In [None]:
# shape

array.shape

The `reshape` method modifies the structure of an array without changing the number of elements.

These methods, known as *reshaping*, are very important in Data Science, as we'll see with the **pandas** library.

In [None]:
# reshape

array.reshape(8, 6)

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 1 &starf;</h3>
    <ul>
        <li>Choose another dimension for this <code>ndarray</code> and test attributes such as <code>ndim</code>, <code>shape</code>, <code>size</code>.</li>
        </ul>
</div>

In [None]:
# %load notebook2/ex_01.py

**np.linspace function**

The `np.linspace` function is useful for plotting functions, for example. It generates a given number of equidistant values between two bounds.

In [None]:
# linspace: 11 values between 0 and 5

np.linspace(0, 5, 11)

# 3. Accessing, selecting and modifying values

In [None]:
array = array.reshape(8, 6)
array

In [None]:
# access to the first line

array[0]

In [None]:
# access to the first column

array[:, 0]

In [None]:
# access to a sub-matrix

array[2:5, 2:4]

In [None]:
# selecting

array[array > 10]

In [None]:
# modification of a sub-matrix

array[2:5, 2:4] = -1
array

# 4. Broadcasting

An array supports arithmetic operations with a scalar and operations with another array.

Basic arithmetic operations are performed term by term: `*`, `+`, `-`, `/`, `**` (power), `%` (modulo).

For operations with another array, NumPy uses the one in a dimension compatible with the operation.

In [None]:
# vector 1

array1 = np.arange(4)
array1

In [None]:
# addition with a scalar

array2 = array1 + 10
array2

In [None]:
# multiplication with another vector

array1 * array2

In [None]:
# 2 x 4 matrix

array3 = np.arange(8).reshape(2, 4)
array3

In [None]:
# addition matrix + vector (broadcast)

array3 + array2

# 5. Universal functions

**NumPy** provides a number of functions for manipulating arrays:
- logical functions: `np.all()`, `np.any()`, `np.where(condition, A, B)` for simple conditions
- mathematical functions: `np.abs()`, `np.sqrt()`, `np.sin()`, `np.cos()`, `np.tan()`, `np.log()`, `np.exp()`, `np.floor()`
- arithmetic functions: `np.sum()`, `np.cumsum()`, `np.min()`, `np.max()`, `np.sort()`, `np.argsort()`
- statistical functions: `np.sum()`, `np.cumsum()`, `np.mean()`, `np.std()`, `np.var()`, `np.median()`, `np.percentile()`, `np.average()` : weighted average
- matrix calculations: `@` or `.dot()`, `T` or `.transpose()`.

Some of these functions can be used with the keyword `axis` to specify the dimension in which the reduction is to be performed.

In [None]:
# array

array

In [None]:
# sum of line

array.sum(axis=0)

In [None]:
# sum of columns

array.sum(axis=1)

In [None]:
# total sum

array.sum()

# 6. Examples with images

If you load an image with the pyplot module of the **matplotlib** library, you get an array on which you can perform manipulations.

In [None]:
# import

import matplotlib.pyplot as plt

## 6.1 Paris image

In [None]:
# Paris

image_paris = plt.imread("./images/tour-eiffel.png")
plt.imshow(image_paris);

The `image_paris` object is a n array.

In [None]:
# type

type(image_paris)

The `image_paris` is an array of 3 dimensions.

In [None]:
# ndim

image_paris.ndim

In [None]:
# shape

image_paris.shape

The 3 dimensions represent:
- image height in pixels, here 694
- image width in pixels, here 1024
- the 3 primary colors: red, green, blue (RGB):
    - `image_paris[:,:,0]` represents primary color values <span style="color:red;font-weight:bold;">red</span>,
    - `image_paris[:,:,1]` represents primary color values <span style="color:green;font-weight:bold;">green</span>,
    - `image_paris[:,:,2]` represents primary color values <span style="color:blue;font-weight:bold;">blue</span>.

In [None]:
# top left pixel

image_paris[0, 0]

For images, values are either floating numbers between 0.0 and 1.0, or integers between 0 and 255 (corresponding to 00 and FF in hexadecimal).

To display an image with the values of each primary color in an image, simply set the values of the other 2 complementary colors to 0.

In [None]:
# coulor of a single pixel

arr = np.full((1, 1, 3), image_paris[0, 0])
plt.imshow(arr);

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i> Tips</h3>
    <p><b>matplotlib</b> always returns an object. A semi-colon <code>;</code>added at the end of <b>matplotlib</b> statement prevents it to be printed (only in notebooks).</p>
</div>

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 2 &starf;</h3>
    <ul>
        <li>Display the bottom right pixel.</li>
        </ul>
</div>

In [None]:
# %load notebook2/ex_02.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 3 &starf;&starf;</h3>
    <ul>
        <li>Extract the Eiffel Tower from the image</li>
        <li>Hint: look at the coordinates of the picture and collect a sub-matrix.</li>
    </ul>
</div>

In [None]:
# %load notebook2/ex_03.py

## 6.2 Mondrian image

In [None]:
# Mondrian

image_mondrian = plt.imread("./images/mondrian-1504681_1280.png")
plt.imshow(image_mondrian);

In [None]:
# shape

image_mondrian.shape

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 4 &starf;&starf;</h3>
    <ul>
        <li>Using at each time the <code>copy()</code> method, display the image corresponding to each primary color: <span style='color:blue;font-weight:bold;'>blue</span>, <span style='color:green;font-weight:bold;'>green</span> and <span style='color:red;font-weight:bold;'>red</span>, in that order.</li>
        <li>At each time, you will need to set to 0 the other channels</li>
        <li>Visual check:
            <ul>
                <li>blue = blue or white in the initial image</li>
                <li>green = yellow or white in the initial image</li>
                <li>red = red, yellow or white in the initial image</li>
            </ul>
        </li>
        <li><em>Nota bene</em> yellow = red + green and white = red + green + blue in additive synthesis.</li>
        </ul>
</div>

In [None]:
# %load notebook2/ex_04.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 5 &starf;&starf;</h3>
    <p>Switch the image to approximate grayscale:</p>
    <ol>
        <li>Copy the image to a new variable.</li>
        <li>Calculate the average, minimum (darker) or maximum (lighter) of the 3 primary colors on the third dimension (<code>axis=2</code>).</li>
        <li>Assign calculated values to the three levels of red, green and blue.</li>
        <li>Display the new approximate grayscale images.</li>
    </ol>
</div>

In [None]:
# %load notebook2/ex_05.py

In fact, grayscale is performed using a weighted average: $Y=0.2989 \times R + 0.5870 \times G + 0.1140 \times B$

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 6 &starf;&starf;&starf;</h3>
    <p>Switch the image to grayscale by using the appropriate formula.</p>
</div>

In [None]:
# %load notebook2/ex_06.py

## 6.3 Other images

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 7 &starf;&starf;</h3>
    <ul>
        <li>Load image "./images/guess1.jpg" (resp. "./images/guess2.jpg") and display it.</li>
        <li>In this picture green and blue pixels (resp. red and green) have been randomly changed.</li>
        <li>Apply a simple transformation to improve the picture and guess what it represents.</li>
        </ul>
</div>

In [None]:
# %load notebook2/ex_07.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 8 &starf;</h3>
    <ul>
        <li>Collect a picture of your own and try to apply some of the methods we have studied.</li>
        </ul>
</div>

# 7. Saving and loading arrays

We use:
- `np.save('my_array',my_array)` to save an array
- `np.load('mon_array.npy')` to load an array
- `np.savez('ziparray.npz', x=my_array, y=mpn_array2)` to zip-save several arrays
- `np.savetxt('textfile.txt', mon_array, delimiter=';')` to save an array in a text file
- `np.loadtxt('textfile.txt', delimiter=';')` to load an array from a text file

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Documentation</h3>
    <ul>
        <li><strong>NumPy</strong> : <a href="https://numpy.org/doc/stable/index.html">https://numpy.org/doc/stable/index.html</a></li>
    </ul>
</div>

<div class="alert alert-danger">
    <h3><i class="fa fa-plus-square"></i>  PART 2</h3>
</div>

![image](./images/pandas2.png)

# 1. Introduction to pandas

1. Series and DataFrame
2. Accessing to data

In [None]:
# imports

import lxml  # for read_html()
import pandas as pd  # import pandas and name the module 'pd'

# display options
pd.set_option("display.min_rows", 16)
pd.set_option("display.max_columns", 30)

## 1.1 Presentation of the dataset

This session uses the data available on the US Social Security web site: https://www.ssa.gov/oact/babynames/limits.html

We will use the national dataset `names.zip`.

### Read me provided by SSA:

> For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number", where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. (...)
> To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.

## 1.2 Loading a single file

**pandas** is equiped with the `read_csv()` function which enables to load CSV files.

In order to understand how to use the `read_csv()` function, we can use the `?` operator.

In [None]:
# pd.read_csv?

Reading a file with **pandas** is performed with a single statement and the appropriate arguments, here:
- file path
- column names

In [None]:
# loading the 2023 file

df2023 = pd.read_csv('data/names/yob2023.txt', names=['name', 'gender', 'births'])
df2023

The result is a `DataFrame` object which is assigned to a variable.
This `DataFrame` object has 31.915 rows and 3 columns.

There are many other functions which aim is to import data according to different formats. One can type `pd.read` and  then press the `Tab` key. Jupyter displays a list of available functions starting with `read`. One can select the appropriate function by using the up and down arrows. Press `Esc` to exit the list.

In [None]:
# pd.read # put the cursor just after pd.read, press Tab key and Esc to exit from the list

function (extract) | usage
-|-
read_clipboard | read data from memory
read_csv | read data from a CSV file (Comma-Separated Value)
read_excel | read data from an Excel workbook
read_html | read data from an HTML file (search for < table > tags)
read_json | read data from a JSON file (JavaScript Object Notation)
read_pickle | read data from a pickled object
read_sql | read data from an SQL query or table
read_table | read data from general delimited file

All of these functions contains a large number of options - through keywords arguments with default values - and which enable to adapt the behavior of the function to the real data to be processed.

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <p>For most of <code>read_xxx()</code> functions, a <code>to_xxx()</code> method exists and export the <code>DataFrame</code> object to the appropriate format, see: <a href='https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html'>IO tools</a></p>
</div>

## 1.3 Webscraping HTML tables

The pandas `read_html()` function enables to perform the webscraping of HTML tables. It returns a list of `DataFrames` for each table found in the page. This function requires the **lxml** package to be installed.

We can try with the geonames Country Codes page: https://www.geonames.org/countries/

In [None]:
# webscraping HTML tables
# var = pd.read_html('https://www.geonames.org/countries/')
var = pd.read_html('data/GeoNames.html')

# get the shape of each collected DataFrame
[df.shape for df in var]

In [None]:
# access to a specific DataFrame of interest
var[1]

## 1.4 Loading all files

We are going to import all data from all files and not a single one.

This is done with a small script that is out of scope for the moment. It has to loop on all files in order to build a large `DataFrame`which includes a fourth column with each year.

We are only interessed in the result of the script.

In [None]:
# Import all data in a single DataFrame

dfs = []  # list of DataFrames for each year

for year in range(1880, 2024):
    filename = f'data/names/yob{year}.txt'  # build the filename from the year
    csv = pd.read_csv(filename, names=['name', 'gender', 'births'])  # load the file as a new DataFrame
    csv['year'] = year  # add a column with the year to the new DataFrame
    dfs.append(csv)  # append the DataFrame to the list

df = pd.concat(dfs, ignore_index=True)  # build a single DataFrame from all DataFrames
df = df[['year', 'name', 'gender', 'births']]  # reorder columns of the global DataFrame
df.to_pickle('names.pkl')  # save the DataFrame to the pickle format

In [None]:
# the whole data
df

The result is a large `DataFrame` with 2.117.219 rows and 4 columns. One can access to each column individually which is called a `Series`.

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Question &starf;</h3>
    <p>Compute manually how many years do we have in this dataset?</p>
</div>

# 2. Series and DataFrame

Originally, the **pandas** name comes from **PAN**(el) + **DA**(taframe) + **S**(eries) where panels, dataframes and series are the 3-D, 2-D and 1-D objects. Panels have been removed from the library and are now replaced by multi-index dataframes.

## 2.1 Series, 1-D objects

One can access to each column of a `DataFrame` using the square bracket operator `[]` and the label of the column.

The obtained object is an instance of Series. it is a 1 dimensionnal structure along with an index.

This **pandas** class relies on the `numpy.ndarray` structure. All methods available for `ndarray` are more or less available for `Series`.

The `Series` object holds an index that is identical to the one of the `DataFrame`. It has a name that is the label of the column of the `DataFrame` object. In fact, it is a view of the `DataFrame`: all modificatons on the `Series` object will be reported on the `DataFrame` object.

### 2.1.1 Common Series attributes

attribute|result
-|-
s.name|name of s
s.values|values of s
s.index|index of s
s.shape|dimension of s
s.size|number of elements of s
s.dtype|type of elements of s
s.empty|True if s is empty, False otherwise

In [None]:
# Series from the 'births' column
s = df["births"]
s

In [None]:
# type of object s
type(s)

In [None]:
# 'name' attribute of s is the label of the column
s.name

In [None]:
# 'values' attribute of s
# it is a 1-D numpy ndarray
s.values

In [None]:
type(s.values)

In [None]:
# 'index' attribute of s
# cf. range type of Python
s.index

In [None]:
# 'shape' attribute of s
s.shape

In [None]:
# 'size' attribute of s
s.size

In [None]:
# 'dtype' attribute of s
s.dtype

### 2.1.2 Common Series methods

`Series` objects have several common methods:

method|result
-|-
s.head()|first elements of s
s.tail()|last elements of s
s.nunique()|number of unique elements of s
s.value_counts()|number of occurrences of unique elements of s
s.unique()|ndarray with unique elements of s

The `Series` class defines many attributes or methods in the `pandas` module. We will explore some of them.

In [None]:
# first elements, by default 5
s.head(10)

In [None]:
# last elements, by default 5
s.tail()

In [None]:
# number of unique values
df['name'].nunique()

In [None]:
# number of unique values
df['gender'].nunique()

In [None]:
# number of unique values
df['year'].nunique()

In [None]:
# value counts
df['births'].value_counts()

**Observation**

The object returned by the `value_counts()` method is also a `Series` object. In **pandas**, many operations on a `Series` or  a `DataFrame` object return another `Series` or `DataFrame` object. This design principle contributes to the power of the library.

By default, the `Series` object is organized so that the values are sorted in the reverse order.

In [None]:
# value counts
s2 = df['births'].value_counts()
type(s2)

In [None]:
# 'index' attribute of s2
s2.index

In [None]:
# to get the most frequent value
s2.index[0]

In [None]:
# 'values' attribute of s2
s2.values

In [None]:
# to get the count of the most frequent value
s2.values[0]

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i>  Exercise 9 &starf;</h3>
<ul>
<li>What is the value counts of genders. How can we interpret the result?</li>
<li>What are the top 16 value counts of names. How can we interpret the result?</li>
<li>What are the 10 years for which we have the most distinct names + gender? What is the finding?</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_09.py

### 2.1.3 Vectorial operations with Series

All vectorial operations for `numpy.ndarray` are available for `Series` objects:

- logical:
    - element wize between 2 `Series` objects: `&` (AND), `|` (OR), `~` (NOT)
    - within values of a single `Series` object: `any()` (OR), `all()` (AND)
- usual mathematical functions: e.g.,  `abs()`, `sqrt()`, `sign()`, `floor()`, `rint()`
- vectorial computation with a scalar value, a list of values with the `isin()` method, or another `Series` object
- comparison with a scalar value or another `Series` object
- usual statistical functions: e.g., `sum()`, `min()`, `max()`, `mean()`, `median()`, `std()`, `var()`, `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, `idxmin()`, `idxmax()`
- advanced mathematical functions: e.g., trigonometric, logarithm, exponential

In [None]:
# series of births
s = df["births"]
s

In [None]:
# sum of s
s.sum()

In [None]:
# min of s
s.min()

In [None]:
# max of s
s.max()

In [None]:
# s times 3
s * 3

In [None]:
# s equals 5
s == 5

In [None]:
# s is in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]?
s.isin(range(10))

In [None]:
# are they rows in which number of births equal year?
(df["births"] == df["year"]).any()

In [None]:
# cumsum of s
s.cumsum()

### 2.1.4 Operations on Series objects containing strings

The `str` operator enables to deal with `Series` objects holding strings in order to get a new `Series` object.

Thanks to the `str` operator, most of Python operators and functions dedicated to strings are available for `Series` object holding strings: `[]`, `len()`, `startswith()`, `contains()`, `endswith()`, `split()`, `lower()`, `upper()`, `capitalize()`, `title()`, ...

These methods return a new `Series` object containing the result of the function applied to each element.

In [None]:
# length of 5 first names
df["name"].str.len().head()

**Observation**: starting with `head()` would have been more efficient, since the length is only computed on the 5 first names.

In [None]:
# length of 5 first names
df["name"].head().str.len()

<div class="alert alert-info">
<h3><i class="fa fa-info-circle"></i> Jupyter hint</h3>
    <p>We can check it with the magic command <code>%timeit</code> which enables to compute the efficiency of a statement.</p>
</div>

In [None]:
%timeit df["name"].str.len().head()

In [None]:
%timeit df["name"].head().str.len()

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Online question &starf; &starf;</h3>
    <p>Which one is the fastest? Guess why?</p>
    <ol>
        <li><b>Python</b>: <code>sum(list(range(10000000)))</code></li>
        <li><b>Numpy</b>: <code>np.arange(10000000).sum()</code> (import numpy as np first)</li>
    </ol>
</div>

<div class="alert alert-info">
<h3><i class="fa fa-info-circle"></i> Magic commands in Jupyter</h3><br />
    A magic command in Jupyter starts with a <code>%</code>.<br />
    There are many magic commands, we list a few of them :
    <ul>
        <li><code>%lsmagic</code>: list all magic commands</li>
        <li><code>%who</code> and <code>%whos</code>: list all variables of global scope</li>
        <li><code>%load</code>: load a script from a file</li>
        <li><code>%timeit</code>: uses the Python timeit module which runs a statement 100,000 times (by default) and then provides the mean of the fastest three times.</li>
        <li><code>%%time</code>: will give you information about a single run of the code in your cell</li>
    </ul>
    <p>Do not confuse magic command and the <code>%</code> operator (remainder between 2 numbers)</p>
</div>

In [None]:
# all magic
%lsmagic

In [None]:
# all variables
%whos

In [None]:
%%time
# compute the time for the whole cell
len([i * i for i in range(10000000)])

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 10 &starf;</h3>
<ul>
<li>Compute the minimum and the maximum length of names</li>
<li>Give the value counts of names length</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_10.py

### 2.1.5 Creating manually a Series object

In [None]:
s = pd.Series([12, 25, 27, 32], name="age")
s

## 2.2 DataFrame, 2-D objects

A `DataFrame` object is a table of data as we encounter frequently in many information systems: CSV or TSV file, Excel sheet, R data.frame, table of a data base...

Generally, a table is compound of a number of homogeneous records or rows. Each line represents a specific record and the columns stand for the features of each records.

The objects that may be included in a `DataFrame` might be of any kind:
- booleans: `bool`
- integers: `int64`
- floating number: `float64`
- date objects: `datetime`, `timestamp`
- strings and any objects: `object` (the more general object)


*Nota bene*

> **pandas** tries to use the most adapted type according to the data in each column
> If the type of a column is `object`, it may contains other objects than strings.

### 2.2.1 Common DataFrame attributes

attribute|result
-|-
df.shape|dimensions of df
df.size|number of elements of df
df.values|values of df
df.index|index of df
df.index.is_unique|if index of df is unique
df.columns|columns of df
df.dtypes|types of columns of df
df.empty|True if df is empty, False otherwise

In [None]:
# type of object df
type(df)

In [None]:
# dimensions of df
df.shape

In [None]:
# number of rows of df
len(df)

In [None]:
# total number of elements of df
df.size

In [None]:
# values of df
# it is a 2-D numpy ndarray
df.values

In [None]:
# first line of values of df
# numpy accessor
df.values[0]

In [None]:
# first column of values of df
# numpy accessor
df.values[:, 0]

In [None]:
# index of df
df.index

In [None]:
# index of df is unique
df.index.is_unique

In [None]:
# columns of df
df.columns

In [None]:
# dtypes of columns of df
df.dtypes

### 2.2.2 Common DataFrame methods

`DataFrame` objects have several common methods:

method|result
-|-
df.head()|first rows of df
df.tail()|last rows of df
df.info()|information on df
df.count()|number of non NA elements in each column
df.nunique()|number of unique elements in each column
df.transpose or df.T()|transposition of df

The `DataFrame` class defines many attributes or methods in the `pandas` module. We will explore some of them.

In [None]:
# first rows, by default 5
df.head()

In [None]:
# last rows, by default 5
df.tail()

In [None]:
# information on df
df.info()

In [None]:
# count
df.count()

In [None]:
# nunique
df.nunique()

In [None]:
# transposition of df
df_t = df.head(100).T # or df.head(100).transpose()
df_t

In [None]:
# index of a transposed df is its columns
df_t.index

In [None]:
# columns of a transposed df are its index
df_t.columns

In [None]:
# first lines in a column
df.head(3).T

<div class="alert alert-info">
<h3><i class="fa fa-info-circle"></i> Attribute notation for <code>DataFrame</code> columns</h3>
<p>When the label of a column does not contains space, or other special characters, and does not overlap with a <code>DataFrame</code> object attribute or method, it is possible to use it as an attribute to get the column.</p>
<p>For instance: <code>df.gender</code></p>
<p>As it does not work systematically, we do not recommend using this notation. It is explained here, since it can be used in code found on the internet. in fact, this attribute notation is also used in <b>pandas</b> <i>method chaining</i></p>
</div>

In [None]:
# column name as an attribute
df.gender

It is possible to modify the index of a `DataFrame` object by using the `set_index()` method with a column of the `DataFrame` object.

In [None]:
# year 2023
df2023

In [None]:
# the name column is used as an index
name2023 = df2023.set_index("name")
name2023

In [None]:
# accessing to a column
name2023["births"]

In [None]:
# index of min
name2023["births"].idxmin()

In [None]:
# index of max
name2023["births"].idxmax()

In [None]:
# index of the new DataFrame object
name2023.index

In [None]:
# name of the index of the new DataFrame object
name2023.index.name

In [None]:
# the index of the new DataFrame object is not unique
name2023.index.is_unique

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 11 &starf;</h3>
<ul>
    <li>Why is the index not unique?</li>
    <li>Limit the <code>name2023</code> DataFrame object to the first 17,532 rows and check that the index is unique.</li>
    <li>If it returns <code>False</code> (<b>pandas</b> bug), compare <code>len(set(name2023.index))</code> with <code>len(name2023.index)</code>.</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_11.py

### 2.2.3 Vectorial operations with DataFrame

Usual statistical functions available for `Series` objects are also available for `DataFrame` objects: e.g., `sum()`, `min()`, `max()`, `mean()`, `median()`, `std()`, `var()`, `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, `idxmin()`, `idxmax()`.

- `f(axis=0)` invokes a function `f` to each <b>column</b> of a `DataFrame` object and returns a <b>row-like</b> result.
- `f(axis=1)` invokes a function `f` to each <b>row</b> of a `DataFrame` object and returns a <b>column-like</b> result.

In [None]:
# max of columns
name2023.max()

### 2.2.4 Creating manually a DataFrame object

In [None]:
# dict of lists

df2 = pd.DataFrame({"name": ["Emma", "Lucas", "Noah", "Olivia"],
                   "age": [12, 25, 27, 32]})
df2

In [None]:
# list of dicts

df2 = pd.DataFrame([{"name": "Emma", "age": 12},
                   {"name": "Lucas", "age": 25},
                   {"name": "Noah", "age": 27},
                   {"name": "Olivia", "age": 32}])
df2

# 3. Accessing to data

Observation about the vocabulary:
- Objects of type `Series` and `DataFrame` have index which might be integer (by default) or more generally labels such as strings. When the index is unique, these objects behave like a dictionary which is accessed with a key.
- These objects are also based on `numpy.ndarray` vectors or tables which can be accessed by position, i.e. by an integer (starting at 0). Therefore these objects may also behave like an ordered sequence of 1 or 2 dimensions which is accessed with a position.

The notation used for ordered sequences in `Python`, and `ndarray` in `numpy`, has been re-used for index of `Series` and `DataFrame`,with a slightly difference:
- When 2 labels are indicated, the second label is included in the selection, contrary to standard Python and **numpy** slicing.
- When 2 labels are indicated, the notation works only if the index is unique.

## 3.1 Accessing to data from Series

Several operators enable to access to data from Series.
- `.loc[]`: selection with a label
    - `s.loc[i]`: selection of a single label
    - `s.loc[[i, j, k]]`: selection of several labels (also called *fancy indexing*)
    - `s.loc[i:j]`, `s[i:j:k]`: selection of a slice (warning `j` is included)
    - `s.loc[mask]`: selection from a boolean mask sharing the very same index than the `Series` object, `True` = selection, `False` = non selection
- `.iloc[]`: same but reserved to positions (&#9888; for slices like `[i:j]`, `j` is excluded)

When the selection corresponds to a single value, it returns a scalar value.

When the selection corresponds to several values, it returns a `Series` object.

To access to a single value, one can also use:
- `.at[]` access to a single value, for index,
- `.iat[]` idem, for positions.

In [None]:
# selection of a column
s = name2023["births"]
s

### 3.1.1 Selection in Series by label or position

In [None]:
# selection by label
s.loc["Olivia"]

In [None]:
# selection by position
s.iloc[0]

In [None]:
# selection by a list of labels
s.loc[["Olivia", "Sophia", "Mia"]]

In [None]:
# selection by a list of positions
s.iloc[[0, 4, 7]]

In [None]:
# selection by a slice of index
# Warning label "Mia" is included
s.loc["Olivia":"Mia"]

In [None]:
# selection by a slice of positions
# Warning position 6 is excluded
s.iloc[0:8]

### 3.1.2 Selection in Series with a boolean mask

It is possible to perform a selection in `Series` object with a boolean mask, i.e. a boolean Series objects sharing the same index than the initial `Series`.

The easiest way is to build a boolean `Series` object from the `Series` object itself.

It is possible to combine boolean masks using logicial operations between boolean objects of type `Series`:

- AND is `&`
- OR is `|`
- NOT is `~`

In [None]:
# all names
s = df["name"]
s

In [None]:
# boolean vector according if the name starts with an "A"
mask = s.str.startswith("A")
mask

In [None]:
# selection from a boolean mask
s2 = s.loc[mask]  # or s[mask]
s2

In [None]:
# names starting with "Nath"
s.loc[s.str.startswith("Nath")]

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Online question</h3>
    <ul>
        <li>Try the selection with other name prefixes: e.g., Chris, Fran, Isa, Rob, Sam.</li>
    </ul>
</div>

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 12 &starf;&starf;</h3>
<ul>
<li>Which names start with a "Z"?</li>
<li>Which names end with a "z"?</li>
<li>Which names start with a "Z" or end with a "z"?</li>
<li>Which names start with a "Z" and end with a "z"?</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_12.py

In [None]:
# names that contain 2 "z" separated by at least one character
s.loc[s.str.contains("z.+z")].unique()

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i> Tips</h3>
    <p>Regular expressions, or regex, enable to capture paterns in strings. In this kind of expressions, some characters have a special meaning to perform matching:</p>
    <ul>
        <li><code>.</code>: matches any character</li>
        <li><code>*</code>: matches 0 or more repetitions of the preceding expression</li>
        <li><code>+</code>: matches 1 or more repetitions of the preceding expression</li>
        <li><code>?</code>: matches 0 or 1 repetition of the preceding expression</li>
        <li><code>^</code>: matches the start of the string</li>
        <li><code>$</code>: matches the end of the string</li>
        <li>Prefix any of those special characters with a backslash <code>\</code> so that they are considered as standard characters</li>
    </ul>
    <p>The string method  <code>contains()</code> tackles by default with regular expressions. Use the <code>regex=False</code> option to search for literals.</p>
</div>

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Online question  &starf; &starf;</h3>
    <P>Try the selection with 2 following z, with 2 z seperated by exactly 1 letter, with 2 z separated by any number of letters (possibly 0).</P>
</div>

<div class="alert alert-warning">
<h3><i class="fa fa-book"></i> Further reading</h3>
<ul>
    <li>Regular expressions are useful when dealing with textual information.</li>
    <li>For more information on regular expressions in Python, see: <a href='https://docs.python.org/3/library/re.html'>re — Regular expression operations</a></li>
</ul>
</div>

## 3.2 Accessing to data from DataFrame

Same operators enable to access to data from `DataFrame`.
- `[]`: selection of column(s)
    - `df[c]`: selection of a single column by label
    - `df[[c, d, e]]`: selection of several columns by labels (*fancy indexing*)
- `.loc[row_indexor, column_indexor]`: selection of rows and columns by index, when `column_indexor` is not defined all columns are returned
    - The `row_indexor` can be:
        - `i`: selection of a single line by index
        - `[i, j, k]`: selection of several lines by index (also called *fancy indexing*)
        - `i:j`: selection of a slice of index (warning, j is included)
        - `mask`: selection with a boolean mask
    - The `column_indexor` can be:
        - `c`: selection of a single column by label
        - `[c, d, e]`: selection of several columns by labels (also called *fancy indexing*)
        - `c:d`: selection of a slice of columns by labels (warning, d is included)
- `.iloc[row_indexor, column_indexor]`: same but reserved to positions (&#9888; for slices like `[i:j]`, `j` is excluded)

When the selection corresponds to a single value, it returns a scalar value.

When the selection corresponds to a part of column or of a row, it returns a `Series` object. When it is a subpart of a column, its index is a subpart of the index of the initial `DataFrame`. When it is a subpart of a row, its index is a subpart of the columns of the initial `DataFrame`.

When the selection corresponds to several rows and several columns, it returns a `DataFrame` object, which index is is a subpart of the index of the initial `DataFrame` and the columns a subpart of the columns of the initial `DataFrame`.

To access to a single value, one can also use:
- `.at[x, y]` access to a single value, for index,
- `.iat[x, y]` idem, for positions.

### 3.2.1 Selecting columns in DataFrame

When selecting a single column, we obtain a `Series` object which shares the same index than the `DataFrame`.

When selecting several columns, we obtain a new `DataFrame`: subset of the original one and sharing the same index with as many rows.

This last technique enables also to reorder the columns of a `DataFrame`.

In [None]:
# selection of columns by labels
df[["name", "gender"]]

**With the `.loc[]` operator**
- the first argument selects rows; to select all rows one can put a column `:`
- the second argument selects columns (optionnal)

In [None]:
# selection of columns by labels using .loc
# the : alone stands for all rows from start to end
df.loc[:, ["name", "gender"]]

**With the `.iloc[]` operator**
- the first argument selects rows; to select all rows one can put a column `:`
- the second argument selects columns (optionnal)

In [None]:
# selection of columns by slice of positions using .iloc
# the : alone stands for all rows from start to end
df.iloc[:, 1:3]

### 3.2.2 Selecting rows in DataFrame

When selecting a single row, we obtain a `Series` object which index corresponds to the columns of the `DataFrame`.

When selecting several rows, we obtain a new `DataFrame`: subset of the original one and sharing the index with as many columns.

**With the `.loc[]` operator**

In [None]:
# selection of a row
row = name2023.loc["Emma"]
row

In [None]:
# row is indeed a Series object
type(row)

In [None]:
# test equality of DataFrame columns and row index
(name2023.columns == row.index).all()

In [None]:
# selection of several rows
name2023.loc["Olivia":"Isabella"]

**With the `.iloc[]` operator**

In [None]:
# selection of the last row
row = name2023.iloc[-1]
row

### 3.2.3 Selecting rows and columns in DataFrame

When selecting a single row, we obtain a `Series` object which index is a subset of the columns of the `DataFrame` object.

When selecting several rows, we obtain a new `DataFrame`: subset of the original one.

**With the `.loc[]` operator**

In [None]:
# selection of a single row and several columns
name2023.loc["Olivia", "gender":"births"]

In [None]:
# selection of several rows and columns
name2023.loc["Olivia":"Isabella", "gender":"births"]

**With the `.iloc[]` operator**

In [None]:
# selection of several rows and columns
df.iloc[[0, -1], 1:3]

**Selection when the index is not unique**

In [None]:
name = df.set_index("name")
name

In [None]:
# selection of a single index
name.loc["Olivia"]

In [None]:
# index is not unique
name.index.is_unique

In [None]:
# selection of a several index
# multiple selection when the index is not unique does not work
name.loc["Olivia":"Mia"]

### 3.2.4 Selection in DataFrame with a boolean mask

It is possible to select in `DataFrame` object with a boolean mask sharing the same index than the `DataFrame`.

The easiest way is to build a boolean `Series` object from one or several columns of the initial `DataFrame` object.

It is possible to combine boolean masks using logicial operations between boolean objects of type `Series`:

- AND is `&`
- OR is `|`
- NOT is `~`

In [None]:
# selection with name equals Emma
mask = (df["name"] == "Emma")
mask

In [None]:
# selection with name equals Emma
df.loc[mask]

In [None]:
# names used at least 1000 times in a year
df.loc[df["births"] >= 1000]

In [None]:
# female names used at least 1000 times in a year
df.loc[(df["gender"] == "F") & (df["births"] >= 1000)]

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 13 &starf;&starf;&starf;</h3>
<ul>
    <li>Print years, names and births of all names with only 2 letters.</li>
    <li>How many names do we have?</li>
    <li>Give the list of such names in alphabetical order.</li>
    <li>How many births of persons with names with only 2 letters?</li>
    <li>How many years where the number of births for a name is exactly 1000?</li>
    <li>Are there any years for which the number of births for a name is equal to the year?.</li>
    <li>Give the list of such names in alphabetical order.</li>
    <li>Tips: for those exercises find the condition part of the question and implement it as a <code>row_indexor</code>, then find the output part and implement it as a <code>column_indexor</code>, plus possibly an aggregation function.
</ul>
</div>

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i> Tips</h3>
    <ul>
    <li>In a statement of the type <code>df.loc[row_indexer, column_indexer]</code></li>
    <ul>
    <li>Conditions for selectng rows are performed through the <code>row_indexer</code>: logical expression</li>
    <li>Output variables are identified through the <code>column_indexer</code>: single label or list of labels</li>
    </ul>
    <li>An aggregation method can be applyied on the result: <code>df.loc[row_indexer, column].aggfunc()</code></li>
    <ul>
    <li>Often <code>column_indexer</code> is the name of a single column, here <code>column</code></li>
    <li>The result is a <code>Series</code> object on which an aggregation method is applyied, here <code>aggfunc()</code></li>
    </ul>
    </ul>
</div>

In [None]:
# %load notebook2/ex_13.py

## 3.3 Testing data

The SSA web site explains:
- To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.
- Name is 2 to 15 characters

We will check these points.

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 14 &starf;</h3>
    <p>Using logical operator:</p>
<ul>
    <li>Check that all births are at least 5</li>
    <li>Check that all names have betwen 2 and 15 characters</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_14.py

## 3.4 Extra exercises

<div class="alert alert-success">
<h3><i class="fa fa-edit"></i> Exercise 15 &starf;&starf;</h3>
<ul>
    <li>Select births which are over 5000</li>
    <li>Which years are multiple of 10?</li>
    <li>Which names start with "Y" and ends with "y"?</li>
    <li>How many names start with "X" and do not end with "a"?</li>
    <li>Which names contain all vowels in the alphabetical order (a, e, i, o, u)? Tips: use a regular expression.</li>
</ul>
</div>

In [None]:
# %load notebook2/ex_15.py

# Summary

The **pandas** module is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

It introduces 1-dimensional `Series` objects and 2-dimentional `DataFrame` objects to manipulate data.

`Series` objects offer dictionary-like along the index and vector-like accessing to data. They might contains booleans, numbers, time objects and strings and provide appropriate operators according to their type of data.

`DataFrame` objects offer dictionary-like along the index and the columns and also matrix-like accessing to data.

The `.loc[]` operator enables to select data using labels or boolean operations.

The `.iloc[]` operator enables to select data using positions.