# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import  accuracy_score,classification_report,ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Load the dataset
df1 = pd.read_csv('modis_2021_India.csv')
df2 = pd.read_csv('modis_2022_India.csv')
df3 = pd.read_csv('modis_2023_India.csv')
df1.head() # print first 5 rows - df1.tail()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	Terra	MODIS	44	6.03	292.6	8.6	D	0
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	Terra	MODIS	37	6.03	287.4	9.0	D	0
2	30.0879	78.8579	300.2	1.3	1.1	2021-01-01	547	Terra	MODIS	8	6.03	286.5	5.4	D	0
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	Terra	MODIS	46	6.03	287.7	10.7	D	0
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	Terra	MODIS	43	6.03	287.6	9.0	D	0
df2.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	30.1138	80.0756	300.0	1.2	1.1	2022-01-01	511	Terra	MODIS	7	6.03	288.4	7.1	D	0
1	23.7726	86.2078	306.1	1.6	1.2	2022-01-01	512	Terra	MODIS	62	6.03	293.5	10.4	D	2
2	22.2080	84.8627	304.8	1.4	1.2	2022-01-01	512	Terra	MODIS	42	6.03	293.3	5.8	D	2
3	23.7621	86.3946	306.9	1.6	1.2	2022-01-01	512	Terra	MODIS	38	6.03	295.2	9.3	D	2
4	23.6787	86.0891	303.6	1.5	1.2	2022-01-01	512	Terra	MODIS	52	6.03	293.1	7.2	D	2
df3.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	9.3280	77.6247	318.0	1.1	1.0	2023-01-01	821	Aqua	MODIS	62	61.03	305.0	7.6	D	0
1	10.4797	77.9378	313.8	1.0	1.0	2023-01-01	822	Aqua	MODIS	58	61.03	299.4	4.3	D	0
2	13.2478	77.2639	314.7	1.0	1.0	2023-01-01	822	Aqua	MODIS	55	61.03	302.4	4.9	D	0
3	12.2994	78.4085	314.3	1.0	1.0	2023-01-01	822	Aqua	MODIS	58	61.03	301.9	4.8	D	0
4	14.1723	75.5024	338.4	1.2	1.1	2023-01-01	823	Aqua	MODIS	88	61.03	305.3	41.5	D	0
df = pd.concat([df1, df2, df3], ignore_index=True)
df.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	Terra	MODIS	44	6.03	292.6	8.6	D	0
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	Terra	MODIS	37	6.03	287.4	9.0	D	0
2	30.0879	78.8579	300.2	1.3	1.1	2021-01-01	547	Terra	MODIS	8	6.03	286.5	5.4	D	0
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	Terra	MODIS	46	6.03	287.7	10.7	D	0
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	Terra	MODIS	43	6.03	287.6	9.0	D	0
df.shape # rows and cols
(271217, 15)
df.info() # dt, memc
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271217 entries, 0 to 271216
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   latitude    271217 non-null  float64
 1   longitude   271217 non-null  float64
 2   brightness  271217 non-null  float64
 3   scan        271217 non-null  float64
 4   track       271217 non-null  float64
 5   acq_date    271217 non-null  object
 6   acq_time    271217 non-null  int64
 7   satellite   271217 non-null  object
 8   instrument  271217 non-null  object
 9   confidence  271217 non-null  int64
 10  version     271217 non-null  float64
 11  bright_t31  271217 non-null  float64
 12  frp         271217 non-null  float64
 13  daynight    271217 non-null  object
 14  type        271217 non-null  int64
dtypes: float64(8), int64(3), object(4)
memory usage: 31.0+ MB
# Any missing values?
df.isnull().sum()
latitude      0
longitude     0
brightness    0
scan          0
track         0
acq_date      0
acq_time      0
satellite     0
instrument    0
confidence    0
version       0
bright_t31    0
frp           0
daynight      0
type          0
dtype: int64
df.duplicated().sum()
0
# List out column names to check
df.columns
Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_date',
       'acq_time', 'satellite', 'instrument', 'confidence', 'version',
       'bright_t31', 'frp', 'daynight', 'type'],
      dtype='object')
df.describe().T # statistics of dataset - numbers!
count	mean	std	min	25%	50%	75%	max
latitude	271217.0	23.947505	4.919846	8.1362	20.9655	23.7888	27.7827	34.9734
longitude	271217.0	81.284024	6.559071	68.4526	75.8802	79.3209	84.7559	97.1044
brightness	271217.0	323.719192	14.147221	300.0000	314.5000	322.0000	330.7000	505.7000
scan	271217.0	1.421732	0.630742	1.0000	1.0000	1.2000	1.5000	4.8000
track	271217.0	1.152716	0.201943	1.0000	1.0000	1.1000	1.2000	2.0000
acq_time	271217.0	824.623755	353.966965	321.0000	648.0000	756.0000	825.0000	2202.0000
confidence	271217.0	64.065081	18.165329	0.0000	54.0000	66.0000	76.0000	100.0000
version	271217.0	21.933778	24.935515	6.0300	6.0300	6.0300	61.0300	61.0300
bright_t31	271217.0	303.499177	8.282440	267.2000	298.2000	302.5000	309.2000	400.1000
frp	271217.0	27.722058	81.017471	0.0000	8.7000	13.5000	24.5000	6961.8000
type	271217.0	0.100385	0.437215	0.0000	0.0000	0.0000	0.0000	3.0000
# Check Unique values of target variable
df.type.value_counts()
0    257625
2     13550
3        42
Name: type, dtype: int64
Exploratory Data Analysis (EDA)
# Check unique and n unique for all categorical features
for col in df.columns:
  if df[col].dtype == 'object':
    print(f"Column: {col}")
    print(f"Unique values: {df[col].unique()}")
    print(f"Number of unique values: {df[col].nunique()}")
    print("-" * 50)
Column: acq_date
Unique values: ['2021-01-01' '2021-01-02' '2021-01-03' ... '2023-12-29' '2023-12-30'
 '2023-12-31']
Number of unique values: 1088
--------------------------------------------------
Column: satellite
Unique values: ['Terra' 'Aqua']
Number of unique values: 2
--------------------------------------------------
Column: instrument
Unique values: ['MODIS']
Number of unique values: 1
--------------------------------------------------
Column: daynight
Unique values: ['D' 'N']
Number of unique values: 2
--------------------------------------------------
# Count plot for 'type'
plt.figure(figsize=(8, 6))
sns.countplot(x='type', data=df)
plt.title('Distribution of Fire Types')
plt.xlabel('Fire Type')
plt.ylabel('Count')
plt.show()

The count plot shows the distribution of different fire types in the dataset.
It is evident that 'MODIS' is the most frequent fire type, followed by 'VIIRS'.
The 'type' variable appears to be unbalanced, with 'MODIS' having significantly more observations than 'VIIRS'. This imbalance might need to be considered during model training.
# Histogram of 'confidence'
plt.figure(figsize=(8, 6))
sns.histplot(df['confidence'], bins=20, kde=True)
plt.title('Distribution of Confidence')
plt.xlabel('Confidence')
plt.ylabel('Frequency')
plt.show()


SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (ipython-input-3352299164.py, line 22)