<img src="https://dauphine.psl.eu/fileadmin/_processed_/9/2/csm_damier_logo_Dauphine_f7b37a1ff2.jpg" width="200" style="vertical-align:middle" /> <h1>Master 222: Introduction to Python - Session 3</h1>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Zaltarba/PSL_python_for_finance/blob/main/python_session_3.ipynb)


# The Pandas library 

In this exercise, you will learn to use the pandas module. Pandas is a Python package specialized in data manipulation.

[For more information on pandas click here](http://pandas.pydata.org/)

To begin, you need to import the `pandas` module under the abbreviated name `pd`. Therefore, execute this preamble cell. NumPy will also be used.



In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np

## Exercice 1

A Series is a one-dimensional array with labels, that can hold any data type. The labels are referred to as the index. Its syntax is as follows: pd.Series(X) where X is a list or an array.  

1. Create a Series of 5 data points from a random list of numbers distributed between 0 and 1.

In [None]:
## Insert your code here


It is possible to specify the indices using the following syntax: pd.Series(X, index = Y) where X is the data list and Y is a list of associated indices.  

2. Create a Series of 4 data points, each with a value of 1, and specify the following list of indices: ['a', 'b', 'c', 'd'].

In [None]:
## Insert your code here


By using indices, you can access the data in the Series in the same way you access elements in a list.  
Slicing is also possible.

3. Create a variable named series_one and assign it a Series created from a list of 4 random numbers distributed between 0 and 1.
4. Use the list of indices from the previous question: ['a', 'b', 'c', 'd'].
5. Retrieve the first element of series_one using the corresponding index.

In [None]:
## Insert your code here


6. Change the fourth data point of `series_un` to 0.
7. Display `series_un`.

In [None]:
## Insert your code here


## Exercice 2

Python consistently returns a `dtype: float64` when calling the Series. This represents the data type, in this case floats, and their encoding, here on 64 bits. You can specify the data type you want to handle when creating a Series.

Furthermore, you can name the Series using the `name` parameter.

1. Create a variable `series_two` from an array of four ones.
2. Specify the data type `dtype` as `int`.
3. Name this Series `my_series`.
4. Display the Series.

In [None]:
## Insert your code here


The `describe()` function returns a variety of information about the Series it is applied to.

5. Create a variable `series_three` from an array of 20 random numbers uniformly distributed between 0 and 1.
6. Display information about the Series using `describe()`.


In [None]:
## Insert your code here


It's possible to add Series together. Pandas will sum the data with matching *indices*. If an *index* is missing in one of the Series, the resulting sum Series will display `NaN` (Not a Number) at that index.

7. Create a Series `series_four` from an array of 19 random numbers uniformly distributed between 0 and 1.
8. Sum `series_three` and `series_four` and look at the result.

In [None]:
## Insert your code here


However, you can specify a particular value to use where the *indices* do not match during a summation. The following syntax is used:
```
## Assume a and b are two Series
a.add(b, fill_value = 0)  ## we decide to replace with 0
```

9. Sum `series_three` and `series_four` by specifying fill_value equal to 100.

In [None]:
## Insert your code here


Lastly, it's possible to use mathematical operators on Series. The following syntax is used:
```
# Assume a is a Series
a[a >= 0.5]  ## returns the data from a greater than 0.5
a * 2  ## multiplies the data from a by two
```

10. Create a variable `a`, and assign it a Series of integer numbers uniformly distributed between 1 and 20, with a size of 20.
11. Display the Series with data strictly greater than 10.

In [None]:
## Insert your code here


## Exercice 3

1. Create an *index* of size 20 that includes "boy" or "girl" randomly distributed.
    - use a list by comprehension
2. Create an array of size 20 that displays ages ranging from 3 to 16 years, randomly distributed.
3. Create a Series `cousins` with `name = "my cousins"`, the index created previously, and data from the array.
4. Using the index, create a Series boys and a Series girls filtering the Series `cousins`.
    - use `cousins.index`
5. Display information about these two Series.

In [None]:
## Insert your code here


In [None]:
## Insert your code here


## Exercice 4

Now we turn our attention to DataFrames. DataFrames are the two-dimensional extension of Series. Thus, the *indices* are shared among the columns of the DataFrame.

A common way to create a DataFrame is by using a dictionary. The syntax is as follows:
```python
pd.DataFrame({'Name of the first column': data_1, 'Name of the second column': data_2})
```
1. Create a DataFrame *df* with two columns: 'Gender' and 'Age', using the data from the previous question.
2. Display the DataFrame.

In [None]:
## Insert your code here


3. Create a list *dominant_hand* of size 20 that contains "left-handed" or "right-handed" distributed randomly.
    - Use a list by comprehension 
4. Add this list as a new column to *df*.  
    - Use `df['Dominant_hand'] = dominant_hand`
    - When naming columns avoid putting spaces, use _ instead


In [None]:
## Insert your code here


*Slicing* is possible with DataFrames.

5. Display the first 5 rows of *data_one*.
    - Using slicing 
    - Using .head() method 

In [None]:
## Insert your code here


6. Display the columns "Gender" and "Dominant Hand".


In [None]:
## Insert your code here


## Exercice 5

It is possible to concatenate two DataFrames using the command `pd.concat()`. The syntax is as follows:
```python
# Assume X and Y are two DataFrames
pd.concat([X,Y], axis = 0)  ## concatenates vertically
pd.concat([X,Y], axis = 1)  ## concatenates horizontally
```
1. Create a list of size 20 that includes "red", "blue", or "green" distributed randomly.
    - use a list by comprehension 
    - use a dictionnary 
    - use np.random.randint
2. Create a DataFrame df_colors from this list.
3. Add a name to the column using the command df_colors.columns = ['Column_Name'].
4. Concatenate df and df_colors into df.


In [None]:
## Insert your code here


## To Go Further

Use pandas documentations or stackoverflow to find the answers of the following exercises.

### Basic DataFrame Operations:

1. Create a DataFrame from a dictionary with keys: 'Name', 'Age', 'City' and populate it with some data.
2. Display the first 5 rows of the DataFrame.
3. Display the last 3 rows of the DataFrame.
4. Display the data types of each column.

### Indexing and Selection:

1. Select the 'Name' and 'City' columns from the DataFrame.
2. Select the row at index 2 from the DataFrame.
3. Select the rows where 'Age' is greater than 25. Display the first 5.

### Sorting and Ranking:

1. Sort the DataFrame based on 'Age' in descending order.
2. Create a variable 'Age_rank', with the oldest as rank 1.

### Missing Data:

1. Introduce some missing values in the DataFrame using np.nan.
    - Create a dataframe using a dictionnary by comprehension 
    - Do not add missing values to the column Name
2. Display the number of missing values for each columns 
3. Fill the missing values with the mean of the non-missing values.

### Grouping and Aggregation:

1. Group the DataFrame by 'City' and calculate the mean age for each city.
2. Find the maximum and minimum age for each city.
3. Create a variable 'Age_rank_by_city', creating a ranking for each city.

### Merging, Joining, and Concatenating:

1. Create a second DataFrame with keys: 'Name', 'Job Title'.
2. Merge the two DataFrames.
    - Use df.merge

In [None]:
## Insert your code here
