<a href="https://colab.research.google.com/github/cbedart/CBPPS/blob/2024/CBPPS_part6_numpy_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<h1><center>Part 6 - NumPy and Pandas</center></h1>**

---

# **➤ NumPy**

- The NumPy module is a must-have for performing calculations on vectors or matrices, element by element, via a new object type called `Array`.
- Unlike the modules seen last week, NumPy is not supplied with the basic Python distribution. You can install it with one of the commands:

```
pip install numpy
OR
conda install -c conda-forge numpy
```
- To load the NumPy module, you have to put on your code `import numpy`
- By convention, we use `np` as a short name for NumPy = `import numpy as np`

- NumPy official user guide is great = https://numpy.org/doc/stable/user/
- A good cheat sheet = https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

In [None]:
import numpy as np

dir(np)

### **Array objects**

- Correspond to one- or multi-dimensional arrays
- Can be used to perform vector calculations
- Function `array()` to convert a "container" (a list or a tuple, as an example) into an array object
- Python will use the word array and `([`  `])` symbols to distinguish it from a list `[]` or a tuple `()` (... but print will format the array as a list of elements)
- Everything we saw with lists and tuples also applies to arrays

In [None]:
import numpy as np

list1 = ["hazelnuts", "apples", "oranges", "pears"]

array1 = np.array(list1)
array1

In [None]:
array1[1]

In [None]:
np.array(range(5))

- An array object contains only homogeneous data = Only one identical type
- It is possible to create an array object from a list containing integers and strings, but in this case, all values will be understood by NumPy as strings:

In [None]:
list2 = ["ruler", 14.368, "disk", 12]

np.array(list2)

- Similarly, it is possible to create an array object from a list of integers and floats, but all values will then be understood by NumPy as only integers (if no float in the elements) or only floats (if at least one float in the elements)

In [None]:
list3 = [12, 54, 29]

np.array(list3)

In [None]:
list4 = [12.5, 54, 29]

np.array(list4)

### **Uses and functions identical to Python base (sometimes even improved)**

- `np.arange()` can create a 1-dimension array, as `range()`
- Generate array objects with not only integers, **but floats too** !
- The `np.concatenate((arrayX, arrayY), axis=0)` must be used if you want to concatenate arrays, since you can't use the + symbol anymore
    - If arrayX and arrayY are 1D, you can only use a tuple of arrays
    - If arrayX and arrayY are 2D, you need to add the `axis` argument to tell NumPy whether to concatenate along axis 0 (rows - default) or axis 1 (columns), if only the dimensions are compatible

In [None]:
np.arange(10)

In [None]:
np.arange(0, 10, 2)

In [None]:
np.arange(10.0)

In [None]:
np.arange(0, 10, 0.2)

In [None]:
array1 = np.array(['hazelnuts', 'apples', 'oranges', 'pears'])
array2 = np.array([1,2,3,4,5])
array3 = np.array([[1, 2], [3, 4], [5, 6]])
array4 = np.array([[1, 2, 2], [3, 4, 5], [5, 6, 0]])

In [None]:
np.concatenate((array1, array2))

In [None]:
np.concatenate((array2, array1))

In [None]:
np.concatenate((array3, array1))

In [None]:
np.concatenate((array3, array4), axis = 0)

In [None]:
np.concatenate((array3, array4), axis = 1)

In [None]:
np.concatenate((array3, array3), axis = 0)

In [None]:
np.concatenate((array3, array3), axis = 1)

/!\ An array is considered as a **vector** /!\\

- Element-by-element vector operations can be performed on this type of object
- All operations are not performed on the “list” or “tuple” element, but on each element from the "array" individually.
- Very useful when analyzing large quantities of data
- Since every single operation is performed on each element, you can easily conditions to get a new array element with `True` or `False` depending on each value

In [None]:
array2 = np.array([1, 2, 3, 4, 5])
array2

In [None]:
array2 + 1

In [None]:
array2 * 5

In [None]:
array2 * array2

In [None]:
array2 > 2

In [None]:
array3 = np.array([[1, 2], [3, 4], [5, 6]])
array3 > 2

- It is also possible to build 2-dimensional array objects by passing a list of lists as arguments to the array() function
- Same as 3D, with list of list of lists
- etc.
- In 2D, organized as tables with rows (1st index) and then columns (2nd index)

In [None]:
array3 = np.array([[1, 2], [3, 4], [5, 6]])
array3

In [None]:
array3[1][0]

In [None]:
array4 = np.array([[[1, 2], [2, 3]], [[4, 5], [5, 6]]])
array4

### **Some useful attributes:**

- `arrayX.ndim` returns the number of array dimensions
- `arrayX.shape` returns the dimensions as a tuple. In the case of a matrix (2D array), the first value of the tuple corresponds to the number of rows, and the second to the number of columns
- `arrayX.size` returns the total number of elements contained in the array



In [None]:
array4.ndim

In [None]:
array4.shape

In [None]:
array4.size


### **Some useful functions:**

- `np.zeros((X,Y))` to generate a 2D array (a matrix) of X rows and Y columns with only 0
- `np.ones((X,Y))` same with only 1
- `np.full(value, (X,Y), type)` same with only the value of the selected type

In [None]:
np.zeros((2, 3))

In [None]:
np.ones((3, 3))

In [None]:
np.full((2, 3), 7, int)

In [None]:
np.full((2, 3), 7, float)

- `np.transpose(arrayX)` to transpose the matrix, the same as `arrayX.T`
- `np.dot(arrayX, arrayY)` to get matrix product between two matrices (since `arrayX * arrayY` returns the product element by element.)
- `np.diag(arrayX)` with an array to diagonalize the matrix, or `np.diag([X,Y,Z])` with a list/tuple to get a diagonal matrix with the chosen numeric values
- etc.

In [None]:
array4 = np.array([[1, 2, 2], [3, 4, 5], [5, 6, 0]])

np.transpose(array4)

In [None]:
np.dot(array4, array4)

In [None]:
array4 * array4

In [None]:
np.diag(array4)

In [None]:
np.diag([1,2,3])

- `np.loadtxt()` to load a numeric data file already organized as rows and columns
- `np.savetxt()` to save a numeric data file organized as rows and columns
- BUT:
    - Each row must have the same number of columns = the function does not handle missing data
    - Each data item is converted to a float, so if a string is encountered the function returns an error
    - By default, data must be separated by any combination of space(s) and/or tabs

### **Some useful methods:**

- `arrayX.reshape((X,Y))` returns a new array with the dimensions specified in the argument, **BUT** will not modify arrayX in place
    - (4,2) is totally different from (2,4)
    - (X,Y) is a tuple containing the new dimensions, and **must** be compatible with the initial dimensions of the array
    - That **will NOT change** the value of the `arrayX` variable
- `arrayX.resize((X,Y))`, on the other hand, doesn't trigger an error in such a situation if the refcheck argument is False as `refcheck=False`, and :
    - If there are fewer dimensions = Cut the array up to the last selected value
    - If there are more dimensions = Fill with 0
    - That **will change** the value of the `arrayX` variable
- `np.resize(arrayX, (X,Y))` will, in the case of a new array larger than the initial one, repeat the initial array in order to fill in the missing cells:
    - That **will NOT change** the value of the `arrayX` variable

In [None]:
array4 = np.array([[[1, 2], [2, 3]], [[4, 5], [5, 6]]])

In [None]:
array4.reshape((4,2))

In [None]:
array4.reshape((2,4))

In [None]:
array4.reshape((2,2))
array4

In [None]:
array4.resize((3,3))
array4

In [None]:
array4.resize((3,3), refcheck=False)
array4

In [None]:
np.resize(array4, (2,2))
array4

### **Indices**

- To retrieve one or more elements from an array object, you can use indices, in the same way as with lists
- Slices and steps can also be used
- The syntax `a[i, j]` is used to compress the retrieval of the element in row i, then column j, into a single block. You can combine with slices to get one full row/column, as `a[i, :]`
- You also can use a boolean matrix, generated with a condition as an example, to easily filter your array based on this condition

In [None]:
array3 = np.array([[1, 2], [3, 4], [5, 6]])
array3

In [None]:
array3[0:2]

In [None]:
array3[0,1]

In [None]:
array3[:,1]

In [None]:
array3 > 3

In [None]:
array3[array3 > 3]

<br />

---   

# **➤ Pandas**

- The Pandas module has been designed for data analysis
- Particularly powerful for manipulating structured data in tabular form
- One of the modules that you will probably use the most in your codes
- You can install it with one of the commands:
```
pip install pandas
OR
conda install -c conda-forge pandas
```
- To load the Pandas module, you have to put on your code `import pandas`
- By convention, we use `pd` as a short name for Pandas = `import pandas as pd`

<br />

- The **HUGE** difference with NumPy is Pandas will also indices, but will also assign a key to each element in each dimension (= row names, column names, etc.)
    - Very similar to the dictionaries within a dictionary seen before
    - Data manipulation is therefore much easier using just the names of the assigned rows and columns
    - If no labels are assigned, the key will automatically the same as the indices
- Moreover, Pandas is perfectly integrated with Jupyter Notebooks, so Google Colab too, to provide easy-to-use representations

<br />

- The Pandas user guide is incredible = https://pandas.pydata.org/docs/user_guide/index.html
- A good cheat sheet from the official website = https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


In [None]:
import pandas as pd
dir(pd)

### **Series**

- Correspond to 1-dimension vector array (X rows but only 1 column)
- Created with `pd.Series(list/tuple/np.array/etc.)`
- You can attribute custom keys with the `index` argument (optional)
- Everything works like NumPy's 1D arrays

In [None]:
pd.Series([1,2,3])

In [None]:
series1 = pd.Series([1,2,3], index = ["Laptop", "Hazelnut", "Robert"])
series1

In [None]:
series1 * 2

- With Pandas, each element in the data series has a label that can be used to call the elements
- You can also use the `.loc[]` "strange method", as there is no `()` but `[]` since we want to locate elements
- To call the first element in the series, you can use its key (here, “Laptop”):


In [None]:
series1["Laptop"]

In [None]:
series1.loc["Laptop"]

- But you can also use the classical index location with the `.iloc[X]` "strange method", as again there is no `()` but `[]` since we want to locate elements

In [None]:
series1.iloc[1]

- You can extract multiple elements by giving a list of keys or a list of indices

In [None]:
series1[["Laptop", "Hazelnut"]]

In [None]:
series1.iloc[[0,1]]

- All Series operations work like dictionaries, using new keys to attribute new values, etc.


In [None]:
series1

In [None]:
series1["Glasses"] = 98
series1


/!\ A Series is ALSO considered as a vector, as the NumPy arrays. Reminder: /!\

- Element-by-element vector operations can be performed on this type of object
- All operations are not performed on the “list” or “tuple” element, but on each element from the "array" individually.
- Since every single operation is performed on each element, you can easily conditions to get a new array element with True or False depending on each value

In [None]:
series1[series1 > 5]

### **Dataframe**

- Pandas main event (and the closest to R tables)
- Two-dimensional tables with row and column labels as keys
- Created with `pd.DataFrame(list of lists/dictionary of dictionaries/2D NumPy array/etc.)`
- You can use the `columns` argument to give columns names, and `index` to give row names - Must be the same dimension
- If you need to change later the columns or index keys, you can use the method `dataframeX.columns` or `dataframeX.index`
- Everything works like NumPy's 2D arrays

In [None]:
pd.DataFrame([[1,2,3],[1,2,3]])

In [None]:
pd.DataFrame([[1,2,3],[1,2,3]], columns=["a","b","c"])

In [None]:
pd.DataFrame([[1,2,3],[1,2,3]], columns=["a","b","c"], index=["x","y"])

In [None]:
pd.DataFrame({"x":{"a":1, "b":2, "c":3}, "y":{"a":1, "b":2, "c":3}})

In [None]:
df1 = pd.DataFrame({"x":[1,2,30], "y":[4,50,6]}, index=["a","b","c"])
df1

In [None]:
df1.columns = ["X", "Y"]
df1

In [None]:
df1.index = ["Robert", "Hazelnuts", "Laptop"]
df1

In [None]:
df1["Z"] = [99, 1586, 10000]
df1

In [None]:
df1.loc["Ruler"] = [12, 12, 12]
df1

### **Some properties**

- `dataframeX.columns` and `dataframeX.index` to get the columns and index keys (seen just before)
- `dataframeX.shape` to get the dimensions of the DataFrame
- `dataframeX.head(n)` to only get the first n rows of the DataFrame
- `dataframeX.info()` to get more information about your DataFrame, as counts, variable types, etc.
- `dataframeX.isna()` to know if there is some missing value

In [None]:
df2 = pd.DataFrame({"Species":["Iris","Iris","Poppy","Poppy","Iris"], "Length":[5,7,10,15,9], "Width":[10,12,5,6,17]})
df2

In [None]:
df2.shape

In [None]:
df2.head(2)

In [None]:
df2.info()

In [None]:
df2.isna()

### **Mathematical/Statistical operators**

- `dataframeX.describe()` to get really fast a LOT of descriptive statistics for each column, or to use in combination with `[]` or `.loc[]`
- `dataframeX.value_counts()` to get the value counts for each column, or one line
- `dataframeX["X"].mean()` or `dataframeX.loc["X"].mean()` to get the mean of a column or a row (or multiple if you give lists)
    - Same using `.min()`, `.max()`, `.std()`, `.median()`, `.sum()`, etc.
    - Same using `.unique()` to only get the unique values

In [None]:
dir(df2["Width"])

In [None]:
df2.describe()

In [None]:
df2["Width"].describe()

In [None]:
df2["Species"].value_counts()

- `df2.groupby("X")` to group rows based on the column with the key "X" and used as new row keys, but to be used with another operator to indicate how the data will be merged:
    - `df2.groupby("X").sum()` to compute the sum
    - `df2.groupby("X").min()` to keep the minimum value
    - `df2.groupby("X").max()` to keep the maximum value
    - `df2.groupby("X").mean()` to compute the mean
    - etc.

In [None]:
df2.groupby("Species")

In [None]:
df2.groupby("Species").mean()

### **Select, filter, manipulate data from a DataFrame**


The selection mechanisms featured in pandas are very powerful:
- **Columns selection:**
    - You can select a column using its key => Series object
    - You can select multiple columns using a list of keys, in order of provided keys => DataFrame object

In [None]:
df1["X"]

In [None]:
df1[["Y","X"]]

- **Rows selection**:
    - Since the columns is the default, you must use `.loc[]` instead of only `[]` to get the rows, either a key or a list of keys

In [None]:
df1.loc["Robert"]

In [None]:
df1.loc[["Laptop", "Robert"]]

- **Columns AND rows selection**:
    - But you can use multiple arguments in the `.loc[]` to get rows AND columns, as `.loc[x,y]`
    - To use the same as `[]`, you must select every single row with the `:` symbol, as `.loc[:,y]`
    - And again, you can use lists of keys

In [None]:
df1.loc["Robert","X"]

In [None]:
df1.loc[:,"X"]

In [None]:
df1.loc[["Laptop", "Robert"], ["Y","X"]]

- **Condition selection**:
    - As NumPy arrays and Pandas Series, you can use conditions (or list/dataframes of booleans) in place of keys
    - You can combine conditions with `[]` or `.loc[]`
    - You can also combine several conditions with `&` for the `and` operator and `|` for the `or` operator, with `()` around each condition

In [None]:
df1["X"]>10

In [None]:
df1[df1["X"]>10]

In [None]:
df1[df1["X"]>10]["Y"]

In [None]:
df1[(df1["X"]>10) & (df1["Y"]< 10)]

In [None]:
df1[(df1["X"]>10) & (df1["Y"]< 5)]

### **Combination of dataframes**

- `pd.concat([df1, df2])` will combine a list of dataframes
- `NaN` = Not a Number, if there is missing values
- By default, will only paste the dataframes one below the others
- The `axis` argument is used to merge existing rows (`axis=0` - default) or columns (`axis=1`)
- The `join` argument is used to specify whether to keep only data common to both dataframes (`join="inner"`) or keep everything (`join="outer"` - default)

In [None]:
data1 = {"Lyon": [10, 23, 17], "Paris": [3, 15, 20]}
data1_df = pd.DataFrame.from_dict(data1)
data1_df.index = ["Shop", "Restaurant", "Bakery"]

data2 = {"Nantes": [3, 9, 14], "Strasbourg": [5, 10, 8]}
data2_df = pd.DataFrame.from_dict(data2)
data2_df.index = ["Shop", "Pharmacy", "Fastfood"]

In [None]:
pd.concat([data1_df, data2_df])

In [None]:
pd.concat([data1_df, data2_df], axis = 0)

In [None]:
pd.concat([data1_df, data2_df], axis = 1)

In [None]:
pd.concat([data1_df, data2_df], axis = 1, join = "inner")

### **Files manipulation**

Pandas includes a wide range of functions for reading and writing files of different types
- `pd.read_csv("file.csv")` (or `pd.read_table("file.csv")`) & `dataframeX.to_csv("file.csv")`
    - The most important one, X-separated values as comma (csv) or tab (tsv)-separated values files, or just plain text files (txt) separated with spaces
    - The `sep` argument is used to give the separator string, as `sep=","`,  `sep="\t"`, or `sep=" "`
    - The `header` argument to tell Python if there is a column names line or not in your file using None or the column index, as `header=None` or `header=3` - Default = row with index 0
    - Same using `index_col` when reading or `index` when writing, to tell Python if there is a row names column or not in your file using None or the row index, as `index_col=None` when reading or `index=0` when writing - Default = column with index 0

In [None]:
!wget https://github.com/cbedart/CBPPS/raw/refs/heads/2024/IMDb_dataset.tsv > /dev/null 2>&1
!mkdir imdb ; mv IMDb_dataset.tsv imdb/IMDb_dataset.tsv

imdb = pd.read_csv("imdb/IMDb_dataset.tsv", sep="\t")
imdb

In [None]:
imdb.to_csv("imdb.txt", sep="µ", header=None, index=None)

- `pd.read_excel("file.xlsx")` & `dataframeX.to_excel("file.xlsx")` to read/save as excel files
- `pd.read_pickle("file.pkl")` & `dataframeX.to_pickle("file.pkl")` to read/save as highly-compressed binary file named pickle. Can be used to store huge amounts of data in a very limited space
- `pd.read_json("file.json")` & `dataframeX.to_json("file.json")` to read/write JSON files
- etc.

<br />

---   

#**➤ Exercises :**


**<u>Exercise 1 - Use case in data analysis... but sounds familiar:</u>** You are going to study some data from IMDb, a website that lists movies and series and lets users rate them.

- By launching the first block below, you will download the file `IMDb_dataset.tsv` under the path `/content/imdb/IMDb_dataset.tsv` (you can click on this link after running the block to see the file in a new window on the right).
- This is a `tab-separated value` text file, where all the fields in a line are separated by a tab `\t`.
- All movies are considered to be longer than 60 minutes, and all series shorter than 60 minutes.

<br />

From the data in this file:

1. Read the file, and transform it into a more suitable format, using what you've seen so far, so you can use it easily.
2. What is the average score for all the movies, and for all the series?
3. What is the most common recommended audience categories for all these productions?
4. How many productions were created in 1998? In 2002? In 2015?
5. Calculate a new variable dividing the number of votes by the rating awarded. Which production is the most successful, with the highest score in relation to the number of votes? And the least one?
6. Search your data for information on "Arcane", the best series ever created, by displaying its score and the number of voters using `print()` and an f-string. Generalize your code with a function to give the information for any row in the dataset if it exists, otherwise say that the name is not in the list.

In [None]:
##### RUN BEFORE YOUR EXERCISE 1 #####

!wget https://github.com/cbedart/CBPPS/raw/refs/heads/2024/IMDb_dataset.tsv > /dev/null 2>&1
!mkdir imdb ; mv IMDb_dataset.tsv imdb/IMDb_dataset.tsv

######################################

In [None]:
# Exercise 1 - #1
# Read the file, and transform it into a more suitable format, using what you've seen so far, so you can use it easily.





In [None]:
# Exercise 1 - #2
# What is the average score for all the movies, and for all the series?





In [None]:
# Exercise 1 - #3
# What is the most common recommended audience categories for all these productions?





In [None]:
# Exercise 1 - #4
# How many productions were created in 1998? In 2002? In 2015?





In [None]:
# Exercise 1 - #5
# Calculate a new variable dividing the number of votes by the rating awarded. Which production is the most successful, with
# the highest score in relation to the number of voters? And the least one?





In [None]:
# Exercise 1 - #6
# Search your data for information on "Arcane", the best series ever created, by displaying its score and the number of voters
# using print() and an f-string. Generalise your code to give the information for any row in the dataset if it exists, otherwise
# say that the name is not in the list.





**<u>Exercise 2 - Evolution of the COVID-19 pandemic in France:</u>**

You will study the summary of indicators tracking the COVID-19 epidemic in France, by French departments, from January 23, 2020 to June 30, 2023, using the `/content/covid.csv` file
- You will find all the information in the header of the data gouv website (in French), but you can easily use DeepL or Google Translate on the webpage/Data description part to get the most important information
- https://www.data.gouv.fr/fr/datasets/synthese-des-indicateurs-de-suivi-de-lepidemie-covid-19/

<br />

From the data in this file:
1. Load the file, and use clear and meaningful names for each column instead of the abbreviations.
2. How many French departments and regions are there in this document?
3. In which department and on which day was there the greatest peak in current hospitalizations and new hospitalizations?
4. Filter your DataFrame to only keep the numbers for the "Nord" department.
    - When did we reach our highest positivity, incidence, and occupancy rates?
    - How many total hospitalizations and deaths were caused by COVID in the department during the epidemic?
    - What was the average number of new hospitalizations per year in 2020, 2021, 2022, and 2023?
5. Group all departments together to obtain data for the whole country.
    - When did we reach our highest virus replication R, positivity, incidence, and occupancy T0 rates?
    - How many total hospitalizations and deaths were caused by COVID in the country during the epidemic?
    - What was the average number of new hospitalizations per year in 2020, 2021, 2022, and 2023?
6. Split the date using the `dataframeX["date"].str.split("-", expand=True)` function, to get a new dataframe with the year, the month, and the day in 3 columns.
    - Add to your big dataframes (all departments and France) the year, month and day in 3 different columns with clear and meaningful names.
    - Using a loop, iterate over each month of your files to create two new monthly dataframe (department and France), to identify the number of new hospitalizations and deaths per month, as well as the average virus replication R, positivity, incidence, and occupancy T0 rates over the time periods.
7. Using a function taking as input the name of a department and a year, automate an output (with `return dataframeX`)to give monthly statistics for the department in question compared with the whole of France as the simplest DataFrame (PS: It's a must to use existing parts of your code, to generalize what you have done.).

In [None]:
##### RUN BEFORE YOUR EXERCISE 2 #####

!wget https://www.data.gouv.fr/fr/datasets/r/5c4e1452-3850-4b59-b11c-3dd51d7fb8b5 > /dev/null 2>&1
!mv 5c4e1452-3850-4b59-b11c-3dd51d7fb8b5 covid.csv

######################################

In [None]:
# Exercise 2 - #1
# Load the file, and use clear and meaningful names for each column instead of the abbreviations.




In [None]:
# Exercise 2 - #2
# How many French departments and regions are there in this document?




In [None]:
# Exercise 2 - #3
# In which department and on which day was there the greatest peak in current hospitalizations and new hospitalizations?




In [None]:
# Exercise 2 - #4
# Filter your DataFrame to only keep the numbers for the "Nord" department.
#   - When did we reach our highest positivity, incidence, and occupancy rates?
#   - How many total hospitalizations and deaths were caused by COVID in the department during the epidemic?
#   - What was the average number of new hospitalizations per year in 2020, 2021, 2022, and 2023?




In [None]:
# Exercise 2 - #5
# Group all departments together to obtain data for the whole country.
#   - When did we reach our highest virus replication R, positivity, incidence, and occupancy T0 rates?
#   - How many total hospitalizations and deaths were caused by COVID in the country during the epidemic?
#   - What was the average number of new hospitalizations per year in 2020, 2021, 2022, and 2023?




In [None]:
# Exercise 2 - #6
# Split the date using the dataframeX["date"].str.split("-", expand=True) function, to get a new dataframe with the year, the month, and the day in 3 columns.
#   - Add to your big dataframes (all departments and France) the year, month and day in 3 different columns with clear and meaningful names.
#   - Using a loop, iterate over each month of your files to create two new monthly dataframe (department and France), to identify the number of new hospitalizations
#       and deaths per month, as well as the average virus replication R, positivity, incidence, and occupancy T0 rates over the time periods




In [None]:
# Exercise 2 - #7
# Using a function taking as input the name of a department and a year, automate an output (with `return dataframeX`)
# to give monthly statistics for the department in question compared with the whole of France as the simplest DataFrame.
# (PS: It's a must to use existing parts of your code, to generalize what you have done.).




**<u>Exercise 3 - As you want:</u>**

If you have time at the end, you can try to analyze some of your research files using Pandas and/or NumPy.

In [None]:
# Exercise 3


