<a href="https://colab.research.google.com/github/cbedart/CBPPS/blob/2024/CBPPS_part7_matplotlib_specialized_libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<h1><center>Part 7 - Matplotlib and specialized libraries </center></h1>**

---

# **➤ Matplotlib**

- Low level graph plotting library - Visualization utility that include:
  - A wide variety of plots (ine, scatter, bar, histogram, ...)
  - Highly customizable visualizations
  - Works seamlessly with other scientific libraries like NumPy and Pandas


- You can install it with one of the commands:

```
pip install matplotlib
OR
conda install -c conda-forge matplotlib
```
- To load the Matplotlib module, you have to put on your code `import matplotlib as mt`
- In reality however, we only use the pyplot part of the library, and this is the only part we import = `import matplotlib.pyplot as plt`

<br />

- /!\ It is strongly recommended to use the official cheatsheets and examples:
  - https://matplotlib.org/cheatsheets/
  - https://matplotlib.org/stable/gallery/index.html

### **Basic plotting**

- `plt.plot(X, Y)` function is used to create a simple plot
- `plt.show()` is mandatory to show the plot
- Main advantage of jupyter notebook & Google Colab = The plot is displayed directly in output

In [None]:
import matplotlib.pyplot as plt

x1 = [1,2,3,4,5]
y1 = [1,4,16,12,25]

plt.plot(x1,y1)
plt.show()

You can use (non-exhaustive list):
- The `color` or `c` argument to change the color (with words, letters, or hexadecimal colors)
- The `marker` argument to change the marker of each point. Some examples:
  - `marker="."` => Point
  - `marker="o"` => Circle
  - `marker="*"` => Star
  - `marker="x"` => X
  - `marker="X"` => X filled
  - `marker="s"` => Square
  - `marker="D"` => Diamond (or d as thin diamond)
- The `markersize` or `ms` argument to change the size of the markers
- The `linewidth` or `lw` argument to change the width of the line
- The `linestyle` or `ls` argument to change the marker of each point. Some examples:
  - `linestyle="-"` => Solid
  - `linestyle=":"` => Dotted
  - `linestyle="--"` => Dashed
  - `linestyle="-."` => Dashed & dotted


In [None]:
plt.plot(x1,y1, marker="o", linestyle="-")
plt.show()

In [None]:
plt.plot(x1,y1, marker="X", linestyle="--", color="#FF0000", ms = 20)
plt.show()

In [None]:
plt.plot(x1,y1, marker="d", linestyle="-.", color="forestgreen", lw=2)
plt.plot(y1,x1, marker="X", linestyle="--", color="dodgerblue", lw = 1)
plt.plot(y1,y1, marker="s", linestyle=":", color="firebrick", lw=0.5)
plt.show()

### **Figure and axes**

- Figure = The entire visualization space, with one or multiple plots
- Axes = The plot within a figure: data, axis labels, ticks, gridlines, title, etc.
- **All the plots will be created using the axes element instead of pyplot base functions**
- Use of `fig, ax = plt.subplots()` to create one figure element and one axes element by default
- Use of `fig, ax = plt.subplots(nb_plots_width, nb_plots_height, figsize=(width,height))` to create:
  - A grid of nb_plots_width x nb_plots_height
  - A figsize of (width,height) inches






In [None]:
fig, ax = plt.subplots()
ax.plot(x1,y1)
plt.show()

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

axs[0,0].plot(x1,y1)
axs[1,0].plot(y1,y1)
axs[0,1].plot(y1,x1)
axs[1,1].plot(x1,x1)
plt.show()

### **Customization**

Using the plt method:
- Title: `plt.title("My Plot")`
- Labels: `plt.xlabel("X-axis")` & `plt.ylabel("Y-axis")`
- Grid: `plt.grid(True)`
- Legend: `plt.legend(["Data"])`
- Axis range: `plt.xlim(min,max)` & `plt.ylim(min,max)`
- Log scale: `plt.xscale("log")` & `plt.yscale("log")`

In [None]:
plt.plot(x1,y1, marker="o")
plt.plot(y1,x1, marker="d")
plt.title("Title of the plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

plt.grid(True)
plt.legend(["Data1", "Data2"])

plt.xlim(1,30)
plt.ylim(1,50)

plt.xscale("log")
plt.yscale("linear")

plt.show()

Using the subplot and fig/ax method => Almost the same, but using `set_XXXX` in almost all cases:
- Title: `ax.set_title("My Plot")`
- Labels: `ax.set_xlabel("X-axis")` & `ax.set_ylabel("Y-axis")`
- Grid: `ax.grid(True)`
- Legend: `ax.legend(["Data"])`
- Axis range: `ax.set_xlim(min,max)` & `ax.set_ylim(min,max)`
- Log scale: `ax.set_xscale("log")` & `ax.set_yscale("log")`


In [None]:
fig, ax = plt.subplots()

ax.plot(x1,y1, marker="o")
ax.plot(y1,x1, marker="d")

ax.set_title("Title of the plot")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")

ax.grid(True)
ax.legend(["Data1", "Data2"])

ax.set_xlim(1,30)
ax.set_ylim(1,50)

ax.set_xscale("log")
ax.set_yscale("linear")

plt.show()

In [None]:
fig, axs = plt.subplots(1,2, figsize=(12, 5))

fig.suptitle("Main title")

axs[0].plot(x1,y1, marker="o")
axs[1].plot(y1,x1, marker="d")

axs[0].set_title("Title of the plot - 1")
axs[0].set_xlabel("X-axis")
axs[0].set_ylabel("Y-axis")

axs[1].set_title("Title of the plot - 2")
axs[1].set_xlabel("X-axis")
axs[1].set_ylabel("Y-axis")

axs[1].grid(True)
axs[0].legend("Data1")

axs[0].set_xlim(1,10)
axs[1].set_ylim(1,6)

axs[1].set_xscale("log")

plt.show()

### **Types of plots**

Again, it is strongly recommended to use the official cheatsheets and examples:
  - https://matplotlib.org/cheatsheets/
  - https://matplotlib.org/stable/gallery/index.html

Here are some examples:
- Line Plot => `plt.plot`
- Scatter Plot => `plt.scatter`
- Bar Plot => `plt.bar`
- Histogram => `plt.hist`
- Pie Chart => `plt.pie`

In [None]:
plt.scatter(x1,y1, marker="o")
plt.show()

In [None]:
plt.bar(x1,y1)
plt.show()

In [None]:
plt.hist(y1, bins = 2)
plt.show()

In [None]:
plt.pie(y1)
plt.show()

### **From Pandas to Matplotlib**

Using a Pandas Dataframe, you can generate easily a plot using the data from several columns of the dataframe as a `plt` or an `ax` element

```
df = pd.DataFrame(data)
df.plot()
```



In [None]:
import pandas as pd

df = pd.DataFrame({"X":x1,"Y":y1})

df.plot(x="X", y="Y", kind="scatter", title="Scatter Plot")

In [None]:
df.plot(x="X", y="Y", kind="bar", title="Scatter Plot")

In [None]:
ax = df.plot(x="X", y="Y", title="Custom Plot")
ax.set_xlim(0, 6)
ax.set_ylim(0, 40)
ax.grid(True)
plt.show()

<br />

---

# **➤ SciPy**

- Open-source Python library built on NumPy
- Additional functionalities for scientific and technical computing
- Comprehensive suite of tools for numerical analysis, optimization, signal processing, etc.
- SciPy integrates seamlessly with NumPy arrays and provides a higher-level interface for many mathematical operations
- Install using `pip install scipy`, and load using `import scipy`
- Official documentation = https://docs.scipy.org/doc/scipy/tutorial/index.html
- Cheat sheet =

  - https://media.datacamp.com/legacy/image/upload/v1676303474/Marketing/Blog/SciPy_Cheat_Sheet.pdf
  - https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_SciPy_Cheat_Sheet_Linear_Algebra.pdf
  

### **Core submodules**

SciPy has several submodules adapted to specific contexts:

- `scipy.constants`: Access to physical and mathematical constants.
- `scipy.linalg`: Advanced linear algebra operations (= solving linear systems, matrix decompositions, etc.).
- `scipy.optimize`: Optimization algorithms (= root-finding, curve fitting, etc.).
- `scipy.integrate`: Numerical integration (= quad, dblquad, simps, etc.).
- `scipy.stats`: Statistical analysis and probability distributions.
- `scipy.spatial`: Spatial data structures and algorithms (= KD-trees, distance calculations, etc.).
- `scipy.interpolate`: Interpolation methods for data fitting.
- `scipy.fft`: Fast Fourier Transform for signal processing.
- `scipy.ndimage`: Image processing capabilities.

### **One of the most important submodules in Biology and Health Sciences = `scipy.stats`:**

You can do:

**=> Descriptive statistics to summarize data:**
  - Mean: `scipy.stats.tmean(data)`
  - Median: `scipy.stats.scoreatpercentile(data, 50)`
  - Variance and Standard Deviation: `scipy.stats.tvar(data)`, `scipy.stats.tstd(data)`
  - Percentiles: `scipy.stats.scoreatpercentile(data, p)`

In [None]:
import scipy.stats

x2 = [14, 19, 18, 17, 18, 16, 21, 21, 8, 11, 10, 13, 10, 17, 8, 10, 14, 17, 8, 13, 13, 15, 17, 11, 12, 14, 8, 10, 11, 9, 21, 12, 17, 16, 19, 21, 21, 24, 10, 11, 14, 8, 20, 18, 12, 25, 10, 4, 8, 12]
y2 = [36, 30, 36, 44, 36, 35, 31, 39, 40, 34, 35, 33, 34, 35, 32, 21, 43, 34, 36, 31, 31, 28, 31, 34, 37, 29, 38, 35, 34, 38, 31, 35, 39, 36, 38, 29, 28, 46, 34, 38, 33, 30, 43, 32, 35, 44, 26, 36, 29, 34]

In [None]:
print(scipy.stats.tmean(x2))
print(scipy.stats.scoreatpercentile(x2, 50))
print(scipy.stats.tvar(x2))
print(scipy.stats.tstd(x2))

**=> Hypothesis testing to evaluating statistical hypotheses**


| **Test Name**              | **Purpose**                                                                 | **Function**                             | **Inputs**                                              | **Example Use Case**                                                                                                                                   |
|----------------------------|-----------------------------------------------------------------------------|------------------------------------------|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| **One-Sample T-Test**      | Test if the mean of a sample differs from a known value                    | `ttest_1samp(data, popmean)`             | `data`: Sample data, `popmean`: Hypothesized mean       | Evaluate if the average blood pressure of a patient group differs from 120 mmHg.                                                                       |
| **Independent T-Test**     | Test if two independent samples have the same mean                        | `ttest_ind(data1, data2)`                | `data1`, `data2`: Two independent samples               | Compare the average heart rates of two groups on different diets.                                                                                      |
| **Paired T-Test**          | Test if the mean difference between paired observations is zero           | `ttest_rel(data1, data2)`                | `data1`, `data2`: Paired sample data                    | Assess the effect of a drug by comparing pre- and post-treatment blood sugar levels in the same patients.                                              |
| **One-Way ANOVA**          | Test if means of multiple groups are equal                                | `f_oneway(data1, data2, ...)`            | `data1`, `data2`, ...: Independent groups              | Determine if average recovery times differ across three different hospitals.                                                                            |
| **Chi-Square Test**        | Test goodness-of-fit between observed and expected categorical data       | `chisquare(f_obs, f_exp)`                | `f_obs`: Observed frequencies, `f_exp`: Expected counts | Test if the observed distribution of side effects matches the expected distribution.                                                                   |
| **Kolmogorov-Smirnov Test**| Test if a sample matches a given distribution                             | `kstest(data, cdf)`                      | `data`: Sample data, `cdf`: Theoretical distribution    | Check if patient weight data follows a normal distribution.                                                                                            |
| **Mann-Whitney U Test**    | Compare two independent samples when data isn’t normally distributed      | `mannwhitneyu(data1, data2)`             | `data1`, `data2`: Two independent samples               | Compare recovery scores of patients undergoing two different therapies with non-normal data.                                                            |
| **Wilcoxon Signed-Rank Test**| Compare two paired samples when data isn’t normally distributed          | `wilcoxon(data1, data2)`                 | `data1`, `data2`: Paired sample data                    | Assess changes in cholesterol levels before and after treatment within the same group of patients.                                                     |
| **Kruskal-Wallis Test**    | Non-parametric test for equality of medians across multiple groups        | `kruskal(data1, data2, ...)`             | `data1`, `data2`, ...: Independent groups              | Test if pain levels differ across three treatment groups when the data is ordinal or not normally distributed.                                          |
| **Fisher’s Exact Test**    | Test the independence of two categorical variables in a 2x2 contingency table | `fisher_exact(table)`                  | `table`: 2x2 contingency table                          | Assess the association between gender (male/female) and response to a drug (effective/ineffective).                                                    |
| **Binomial Test**          | Test if the proportion of success in a sample matches a known proportion  | `binomtest(k, n, p)`                     | `k`: Success count, `n`: Sample size, `p`: Hypothesized proportion | Check if the proportion of patients responding to a treatment is significantly different from 50%.                                                     |
| **Levene’s Test**          | Test the equality of variances in multiple groups                         | `levene(data1, data2, ...)`              | `data1`, `data2`, ...: Samples                          | Test if blood pressure variability is the same across different age groups.                                                                             |
| **Shapiro-Wilk Test**      | Test if data is normally distributed                                      | `shapiro(data)`                          | `data`: Sample data                                      | Verify if the weights of patients in a study are normally distributed.                                                                                  |
| **Bartlett’s Test**        | Test the equality of variances across groups (assumes normality)          | `bartlett(data1, data2, ...)`            | `data1`, `data2`, ...: Samples                          | Determine if drug efficacy measurements have similar variability across treatment groups.                                                               |

In [None]:
scipy.stats.ttest_1samp(x2, 14)

In [None]:
scipy.stats.ttest_ind(x2, y2)

In [None]:
print(scipy.stats.shapiro(x2))
print(scipy.stats.shapiro(y2))

In [None]:
scipy.stats.mannwhitneyu(x2, y2)

<br />

---

# **➤ RDKit:**

- Open-source toolkit for cheminformatics
- https://www.rdkit.org/docs/GettingStartedInPython.html
- Main features:
  - Molecular representation = Handles molecules using SMILES, InChI, or molecular files
  - 2D and 3D visualization = Generates high-quality molecular drawings and visualizations
  - Chemical property calculation = Computes molecular properties like molecular weight, logP, and topological polar surface area
  - Molecular Fingerprints = Generates various molecular fingerprints for similarity searches and machine learning
  - Reaction Modeling = Supports reaction mapping and retrosynthesis studies
  - Interconnected with Pandas

<br />

---

# **➤ BioPython:**

- Open-source toolkit for bioinformatics
- https://biopython.org/docs/dev/Tutorial/index.html
- Main features:
  - Sequence handling = Reads, writes, and manipulates biological sequences in FASTA, GenBank, and other formats
  - Structural bioinformatics = Handles 3D molecular structures for protein and nucleic acid analysis.
  - Data analysis = Includes algorithms for alignment, clustering, and phylogenetics.
  - Interfacing with databases = Fetches biological data from public databases like NCBI and UniProt.
  - Graphics support: Creates publication-quality plots of sequences, structures, and other data.

<br />

---

#**➤ Exercises :**


**<u>Evolution of the COVID-19 pandemic in France - Part 2:</u>**

You will continue to study the summary of indicators tracking the COVID-19 epidemic in France, by French departments, from January 23, 2020 to June 30, 2023, using the `/content/covid.csv` file
- You will find all the information in the header of the data gouv website (in French), but you can easily use DeepL or Google Translate on the webpage/Data description part to get the most important information
- https://www.data.gouv.fr/fr/datasets/synthese-des-indicateurs-de-suivi-de-lepidemie-covid-19/

<br />

1. Plot the trend of COVID-19 hospitalizations **in the Nord** using a scatter plot.
2. Plot the trend of COVID-19 hospitalizations **in France** using a line graph.
3. Plot a boxplot of new hospitalizations in "summer" 2021 per region (summer = june, july, august).
4. Is the difference between new hospitalizations in "summer" 2021 in "Hauts-de-France" and "Occitanie" statistically significant? And between "Hauts-de-France" and "Provence-Alpes-Côte d'Azur" ?
5. On the same plot, show the new hospitalizations in "summer" 2021 for those 3 regions.


In [None]:
##### RUN BEFORE YOUR EXERCISE #####

!wget https://www.data.gouv.fr/fr/datasets/r/5c4e1452-3850-4b59-b11c-3dd51d7fb8b5 > /dev/null 2>&1
!mv 5c4e1452-3850-4b59-b11c-3dd51d7fb8b5 covid.csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

######################################

In [None]:
# Exercise - #1
# Plot the trend of COVID-19 hospitalizations in the Nord using a scatter plot.



In [None]:
# Exercise - #2
# Plot the trend of COVID-19 hospitalizations in France using a line graph.



In [None]:
# Exercise - #3
# Plot a boxplot of new hospitalizations in "summer" 2021 per region (summer = june, july, august).



In [None]:
# Exercise - #4
# Is the difference between new hospitalizations in "summer" 2021 in "Hauts-de-France" and "Occitanie" statistically significant?
# And between "Hauts-de-France" and "Provence-Alpes-Côte d'Azur"?



In [None]:
# Exercise - #5
# On the same plot, show the new hospitalizations in "summer" 2021 for those 3 regions.



<br />

---

# **➤ Create a python project as part of your thesis:**

In this part of the course, you should try to see what you can automate, analyze, code to speed up and/or simplify your work, as part of your thesis.

<br />

**Main objectives:**
- Try to use as few generative AIs as possible to generate your code. You can, however, use them to help you better understand how to use certain libraries, or to explain code to you

- Use mostly functions, so you can reuse them much more easily later
- Use everything you have learned so far in Python to apply it to your research work
- Have the curiosity to find relevant python libraries you can use, and find related documentation
- Don't use sensitive data here: you must prevent any issue, since I will help you and we can talk about it collectively

<br />

**A few ideas, if you don't have any:**
- Automate your data visualization by generating reports and graphs

- Create functions to avoid multiple copy-paste that can lead to inattentive errors, and/or improve the reproducibility of your work
- Convert data from your instrument, analysis software, etc. to a more usable format, in Python or using Excel
- Make your analyses accessible to non-experts by simplifying the steps as much as possible
- If you have a niche domain, simplify some operations to make them available to your peers in the future