<a href="https://colab.research.google.com/github/awasthikripa93-a11y/fire-classification/blob/main/Copy_of_jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [None]:
Import Libraries
pip install numpy pandas matplotlib seaborn scikit-learn folium
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import  accuracy_score,classification_report,ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Load the dataset
df1 = pd.read_csv('modis_2021_India.csv')
df2 = pd.read_csv('modis_2022_India.csv')
df3 = pd.read_csv('modis_2023_India.csv')
df1.head() # print first 5 rows - df1.tail()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	Terra	MODIS	44	6.03	292.6	8.6	D	0
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	Terra	MODIS	37	6.03	287.4	9.0	D	0
2	30.0879	78.8579	300.2	1.3	1.1	2021-01-01	547	Terra	MODIS	8	6.03	286.5	5.4	D	0
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	Terra	MODIS	46	6.03	287.7	10.7	D	0
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	Terra	MODIS	43	6.03	287.6	9.0	D	0
df2.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	30.1138	80.0756	300.0	1.2	1.1	2022-01-01	511	Terra	MODIS	7	6.03	288.4	7.1	D	0
1	23.7726	86.2078	306.1	1.6	1.2	2022-01-01	512	Terra	MODIS	62	6.03	293.5	10.4	D	2
2	22.2080	84.8627	304.8	1.4	1.2	2022-01-01	512	Terra	MODIS	42	6.03	293.3	5.8	D	2
3	23.7621	86.3946	306.9	1.6	1.2	2022-01-01	512	Terra	MODIS	38	6.03	295.2	9.3	D	2
4	23.6787	86.0891	303.6	1.5	1.2	2022-01-01	512	Terra	MODIS	52	6.03	293.1	7.2	D	2
df3.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	9.3280	77.6247	318.0	1.1	1.0	2023-01-01	821	Aqua	MODIS	62	61.03	305.0	7.6	D	0
1	10.4797	77.9378	313.8	1.0	1.0	2023-01-01	822	Aqua	MODIS	58	61.03	299.4	4.3	D	0
2	13.2478	77.2639	314.7	1.0	1.0	2023-01-01	822	Aqua	MODIS	55	61.03	302.4	4.9	D	0
3	12.2994	78.4085	314.3	1.0	1.0	2023-01-01	822	Aqua	MODIS	58	61.03	301.9	4.8	D	0
4	14.1723	75.5024	338.4	1.2	1.1	2023-01-01	823	Aqua	MODIS	88	61.03
f = pd.concat([df1, df2, df3], ignore_index=True)
df.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	Terra	MODIS	44	6.03	292.6	8.6	D	0
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	Terra	MODIS	37	6.03	287.4	9.0	D	0
2	30.0879	78.8579	300.2	1.3	1.1	2021-01-01	547	Terra	MODIS	8	6.03	286.5	5.4	D	0
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	Terra	MODIS	46	6.03	287.7	10.7	D	0
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	Terra	MODIS	43	6.03	287.6	9.0	D	0
df.shape # rows and cols
(271217, 15)
df.info() # dt, memc
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271217 entries, 0 to 271216
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   latitude    271217 non-null  float64
 1   longitude   271217 non-null  float64
 2   brightness  271217 non-null  float64
 3   scan        271217 non-null  float64
 4   track       271217 non-null  float64
 5   acq_date    271217 non-null  object
 6   acq_time    271217 non-null  int64
 7   satellite   271217 non-null  object
 8   instrument  271217 non-null  object
 9   confidence  271217 non-null  int64
 10  version     271217 non-null  float64
 11  bright_t31  271217 non-null  float64
 12  frp         271217 non-null  float64
 13  daynight    271217 non-null  object
 14  type        271217 non-null  int64
dtypes: float64(8), int64(3), object(4)
memory usage: 31.0+ MB
# Any missing values?
df.isnull().sum()
latitude      0
longitude     0
brightness    0
scan          0
track         0
acq_date      0
acq_time      0
satellite     0
instrument    0
confidence    0
version       0
bright_t31    0
frp           0
daynight      0
type          0
dtype: int64
df.duplicated().sum()
0
# List out column names to check
df.columns
Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_date',
       'acq_time', 'satellite', 'instrument', 'confidence', 'version',
       'bright_t31', 'frp', 'daynight', 'type'],
      dtype='object')
df.describe().T # statistics of dataset - numbers!
count	mean	std	min	25%	50%	75%	max
latitude	271217.0	23.947505	4.919846	8.1362	20.9655	23.7888	27.7827	34.9734
longitude	271217.0	81.284024	6.559071	68.4526	75.8802	79.3209	84.7559	97.1044
brightness	271217.0	323.719192	14.147221	300.0000	314.5000	322.0000	330.7000	505.7000
scan	271217.0	1.421732	0.630742	1.0000	1.0000	1.2000	1.5000	4.8000
track	271217.0	1.152716	0.201943	1.0000	1.0000	1.1000	1.2000	2.0000
acq_time	271217.0	824.623755	353.966965	321.0000	648.0000	756.0000	825.0000	2202.0000
confidence	271217.0	64.065081	18.165329	0.0000	54.0000	66.0000	76.0000	100.0000
version	271217.0	21.933778	24.935515	6.0300	6.0300	6.0300	61.0300	61.0300
bright_t31	271217.0	303.499177	8.282440	267.2000	298.2000	302.5000	309.2000	400.1000
frp	271217.0	27.722058	81.017471	0.0000	8.7000	13.5000	24.5000	6961.8000
type	271217.0	0.100385	0.437215	0.0000	0.0000	0.0000	0.0000	3.0000
# Check Unique values of target variable
df.type.value_counts()
0    257625
2     13550
3        42
Name: type, dtype: int64
Exploratory Data Analysis (EDA)
# Check unique and n unique for all categorical features
for col in df.columns:
  if df[col].dtype == 'object':
    print(f"Column: {col}")
    print(f"Unique values: {df[col].unique()}")
    print(f"Number of unique values: {df[col].nunique()}")
    print("-" * 50)
Column: acq_date
Unique values: ['2021-01-01' '2021-01-02' '2021-01-03' ... '2023-12-29' '2023-12-30'
 '2023-12-31']
Number of unique values: 1088
--------------------------------------------------
Column: satellite
Unique values: ['Terra' 'Aqua']
Number of unique values: 2
--------------------------------------------------
Column: instrument
Unique values: ['MODIS']
Number of unique values: 1
--------------------------------------------------
Column: daynight
Unique values: ['D' 'N']
Number of unique values: 2
--------------------------------------------------
# Count plot for 'type'
plt.figure(figsize=(8, 6))
sns.countplot(x='type', data=df)
plt.title('Distribution of Fire Types')
plt.xlabel('Fire Type')
plt.ylabel('Count')
plt.show()

The count plot shows the distribution of different fire types in the dataset.
It is evident that 'MODIS' is the most frequent fire type, followed by 'VIIRS'.
The 'type' variable appears to be unbalanced, with 'MODIS' having significantly more observations than 'VIIRS'. This imbalance might need to be considered during model training.
# Histogram of 'confidence'
plt.figure(figsize=(8, 6))
sns.histplot(df['confidence'], bins=20, kde=True)
plt.title('Distribution of Confidence')
plt.xlabel('Confidence')
plt.ylabel('Frequency')
plt.show()

The histogram illustrates the distribution of the 'confidence' feature.
The distribution appears to be bimodal, with peaks around low confidence values and high confidence values.
There are fewer observations in the middle range of confidence.
This suggests that observations are often recorded with either low confidence or high confidence.
# Box plot for 'confidence' by 'type'
plt.figure(figsize=(8, 6))
sns.boxplot(x='type', y='confidence', data=df)
plt.title('Confidence by Fire Type')
plt.xlabel('Fire Type')
plt.ylabel('Confidence')
plt.show()

The box plot shows the distribution of 'confidence' for each fire type
Both 0 and 2 have a wide range of confidence values.
The median confidence for both types appears to be in the higher range.
There are some outliers, particularly for the 'MODIS' type, indicating observations with unusually low or high confidence.
# Scatter plot of 'latitude' vs 'longitude'
plt.figure(figsize=(10, 8))
sns.scatterplot(x='longitude', y='latitude', data=df, hue='type', s=10)
plt.title('Fire Locations by Type')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Fire Type')
plt.show()

The scatter plot visualizes the geographical distribution of fire locations, colored by fire type.
It provides a visual representation of where fires are occurring based on latitude and longitude.
Different fire types might be concentrated in specific geographical areas, which could be a useful feature for modeling.
The density of points indicates areas with higher fire activity.
# Count plot for 'daynight'
plt.figure(figsize=(6, 4))
sns.countplot(x='daynight', data=df)
plt.title('Distribution of Day/Night Observations')
plt.xlabel('Day/Night')
plt.ylabel('Count')
plt.show()

The count plot for 'daynight' shows whether the fire observations were made during the day or night.
It indicates the proportion of day versus night observations in the dataset.
Knowing the distribution of day/night observations can be relevant as detection capabilities or fire behavior might differ between day and night.
# Count plot for 'Satellite'
plt.figure(figsize=(6, 4))
sns.countplot(x='satellite', data=df)
plt.title('Distribution of Satellite Observations')
plt.xlabel('Satellite')
plt.ylabel('Count')
plt.show()

This count plot shows the distribution of observations made by different satellites.
It reveals which satellites contributed the most data to the dataset.
Understanding the satellite distribution can be important as different satellites may have different characteristics or coverage.
# Count plot for 'version'
plt.figure(figsize=(6, 4))
sns.countplot(x='version', data=df)
plt.title('Distribution of Version')
plt.xlabel('Version')
plt.ylabel('Count')
plt.show()

#this code take more time
#Pairplot for numerical features (subset)
#sns.pairplot(df[['latitude', 'longitude', 'brightness', 'confidence', 'frp', 'type']], hue='type', diag_kind='kde')
#plt.suptitle('Pairplot of Numerical Features')
#plt.show()
The pairplot provides a matrix of scatter plots for all pairs of numerical features and histograms/KDE plots on the diagonal for each feature, separated by the 'type' variable. Here are some insights from the pairplot:

Individual Feature Distributions (Diagonal): The diagonal plots (histograms/KDEs) show the distribution of each numerical feature for each fire type.

latitude and longitude: These show the geographical distribution, reinforcing the scatter plot observation. Different fire types appear to be concentrated in certain geographical areas.
brightness: The distribution of brightness values can be compared between fire types. There might be differences in the typical brightness of fires detected by MODIS versus VIIRS.
confidence: This shows the distribution of confidence for each type, similar to the earlier box plot but as a histogram/KDE. It can highlight differences in the confidence levels associated with each fire type.
frp: The distribution of fire radiative power (FRP) can be compared. This might reveal if one fire type tends to have significantly higher or lower FRP values than the other.
Relationships Between Features (Off-Diagonal Scatter Plots): The off-diagonal scatter plots show the relationship between pairs of numerical features, colored by fire type.

latitude vs. longitude: As seen before, this visualizes the geographical distribution by type.
brightness vs. confidence: This plot shows the relationship between brightness and confidence. Is there a correlation? Does higher brightness tend to correlate with higher confidence? How does this relationship differ between fire types?
brightness vs. frp: This shows the relationship between brightness and fire radiative power. These two features are likely related. The plot can reveal the strength and nature of this relationship and whether it varies by fire type.
confidence vs. frp: This visualizes the relationship between confidence and FRP. Does higher FRP tend to result in higher confidence? How does this relationship differ for different fire types?
Other pairs: Examine the relationships between latitude/longitude and the other numerical features (brightness, confidence, frp). Are there geographical patterns in these features?
# Heatmap of correlations between numerical features
plt.figure(figsize=(10, 8))
correlation_matrix = df[['latitude', 'longitude', 'brightness', 'confidence', 'frp']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

The heatmap visualizes the Pearson correlation coefficients between the numerical features: 'latitude', 'longitude', 'brightness', 'confidence', and 'frp'. The values range from -1 to 1, where:

1 indicates a perfect positive linear correlation.
-1 indicates a perfect negative linear correlation.
0 indicates no linear correlation.
The color intensity and the annotation (annot=True) help in quickly identifying the strength and direction of the relationships.
Key insights from the heatmap:

High Correlation between brightness and frp: There appears to be a strong positive correlation between 'brightness' and 'frp'. This is expected as both features are related to the intensity of the fire. Higher brightness is likely to be associated with higher fire radiative power. This strong correlation might indicate multicollinearity if both features are used directly in a linear model, but can also be insightful for understanding the data.

Moderate Correlation between brightness and confidence: There seems to be a moderate positive correlation between 'brightness' and 'confidence'. This suggests that brighter fire detections tend to be associated with higher confidence levels.

Moderate Correlation between frp and confidence: Similarly, there is likely a moderate positive correlation between 'frp' and 'confidence'. Fires with higher radiative power might be easier to detect and thus have higher confidence scores.

Low Correlation with Geographical Features: The correlations between 'latitude' and 'longitude' with 'brightness', 'confidence', and 'frp' appear to be relatively low. This suggests that the intensity or confidence of a fire detection is not strongly linearly related to its geographical location. While there might be spatial patterns as seen in the scatter plot, a simple linear correlation doesn't capture them strongly.

Correlation between latitude and longitude: The correlation between 'latitude' and 'longitude' is often low unless there's a specific geographical pattern in the data that aligns linearly. In this case, it's likely low, indicating that fires are distributed across various locations without a strong linear relationship between their latitude and longitude coordinates within the dataset.

Overall, the heatmap provides a concise overview of the linear relationships between the numerical features. It highlights the expected strong correlations between features related to fire intensity (brightness, frp, confidence) and shows that geographical coordinates have weaker linear relationships with these intensity measures. This information can be valuable for feature selection, understanding feature interactions, and guiding the choice of modeling techniques.

numerical_cols = df.select_dtypes(include=np.number).columns
numerical_cols
Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_time',
       'confidence', 'version', 'bright_t31', 'frp', 'type'],
      dtype='object')
numerical_cols = ['brightness', 'scan', 'track', 'acq_time','confidence', 'version', 'bright_t31', 'frp']
df[numerical_cols].hist(bins=50, figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()

'brightness': The distribution of brightness values. This shows the range of detected fire brightness and where the values tend to cluster. It might reveal if fires tend to be of low, medium, or high brightness.
'scan': The distribution of scan sizes. This feature relates to the size of the pixel footprint. The histogram shows the typical scan sizes in the dataset.
'track': Similar to scan, this relates to the track size. The histogram shows the distribution of track sizes.
'acq_time': The distribution of acquisition times (likely represented as a numerical value like time of day). This histogram can reveal patterns in when fires are detected (e.g., more detections during certain hours).
'confidence': The distribution of confidence scores. This is a numerical representation of the earlier confidence histogram and box plot. It reinforces the bimodal nature observed earlier.
'version': The distribution of different version values. This shows the frequency of observations from different processing versions.
'bright_t31': The distribution of brightness temperature at band 31. This is another measure related to fire intensity. Its distribution can be compared to 'brightness'.
'frp': The distribution of fire radiative power. This shows the typical FRP values in the dataset and their range. It complements the 'brightness' histogram in understanding fire intensity.
'type': While 'type' is included in the numerical columns list due to its representation, its histogram will show the distribution of the encoded numerical values for fire types. This visually confirms the class imbalance seen in the count plot.
-Overall, these histograms provide a detailed look at the individual distributions of the numerical features. They help in understanding the range, central tendency, and variability of each feature, identifying potential outliers, and assessing the shape of the distribution (e.g., normal, skewed, bimodal). This information is crucial for data preprocessing, feature understanding, and selecting appropriate modeling techniques.

import statsmodels.api as sm
import scipy.stats as stats

# List of numerical features to check for distribution
numerical_features = ['brightness', 'confidence', 'frp', 'bright_t31', 'scan', 'track']

for feature in numerical_features:
    print(f"Analyzing distribution for: {feature}")

    # KDE Plot
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    sns.kdeplot(df[feature], fill=True)
    plt.title(f'KDE Plot of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Density')

    # QQ Plot
    plt.subplot(1, 2, 2)
    stats.probplot(df[feature], dist="norm", plot=plt)
    plt.title(f'QQ Plot of {feature}')

    plt.tight_layout()
    plt.show()
    print("-" * 50)
Analyzing distribution for: brightness

--------------------------------------------------
Analyzing distribution for: confidence

--------------------------------------------------
Analyzing distribution for: frp

--------------------------------------------------
Analyzing distribution for: bright_t31

--------------------------------------------------
Analyzing distribution for: scan

--------------------------------------------------
Analyzing distribution for: track

--------------------------------------------------
'brightness': Distribution is skewed and bimodal, QQ plot shows significant deviation from normality.

'confidence': Distribution is bimodal with peaks at low and high values, QQ plot confirms non-normality, especially in the tails.

'frp': Distribution is highly skewed to the right, QQ plot shows a strong departure from the normal distribution, particularly for larger values.

'bright_t31': Distribution appears somewhat skewed, QQ plot indicates deviation from normality, especially at the extremes.

'scan': Distribution is concentrated at lower values with a tail towards higher values, QQ plot suggests non-normality.

'track': Distribution is concentrated at lower values with a tail towards higher values, QQ plot suggests non-normality.

# --- Temporal Analysis ---
# Convert 'acq_date' to datetime objects
df['acq_date'] = pd.to_datetime(df['acq_date'])
# Extract temporal features
df['year'] = df['acq_date'].dt.year
df['month'] = df['acq_date'].dt.month
df['day_of_week'] = df['acq_date'].dt.dayofweek # Monday=0, Sunday=6
df['day_of_year'] = df['acq_date'].dt.dayofyear
df['hour'] = df['acq_time'].astype(str).str[:2].astype(int) # Assuming acq_time is HHMM
Extracting Temporal Features: It converts the acq_date column to datetime objects and extracts new features like year, month, day_of_week, day_of_year, and hour from the acquisition date and time.
Visualizing Temporal Distributions: It generates count plots to show:
The number of fire detections per month.
The number of fire detections per day of the week.
# Visualize fire detections over months
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='month', palette='viridis')
plt.title('Fire Detections by Month (2023)')
plt.xlabel('Month')
plt.ylabel('Number of Detections')
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
/var/folders/tx/wk7wgjjj50l10ddtt3kczpnh0000gn/T/ipykernel_46389/3766763484.py:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=df, x='month', palette='viridis')

# Visualize fire detections by day of the week
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='day_of_week', palette='viridis')
plt.title('Fire Detections by Day of Week (2023)')
plt.xlabel('Day of Week')
plt.ylabel('Number of Detections')
plt.xticks(ticks=range(7), labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()
/var/folders/tx/wk7wgjjj50l10ddtt3kczpnh0000gn/T/ipykernel_46389/714612371.py:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.countplot(data=df, x='day_of_week', palette='viridis')

Outliers and Outlier Treatments
Outliers: Outliers are data points that are significantly different from other observations in a dataset. They can occur due to measurement errors, data entry mistakes, or genuinely rare events. Outliers can skew statistical analyses (like mean, standard deviation) and impact the performance of machine learning models.

# Visualize outliers using box plots for key numerical features
plt.figure(figsize=(12, 8))
sns.boxplot(data=df[numerical_cols])
plt.title('Box Plots for Key Numerical Features')
plt.ylabel('Value')
plt.show()
'brightness', 'bright_t31', 'frp': These fire intensity-related features show a wide range and numerous high-value outliers, suggesting that while most fires might have moderate intensity, there are instances of very bright or high-FRP fires. The lower whiskers might also show some outliers on the lower end.

'scan', 'track': These features related to pixel size also show outliers, indicating observations where the scan/track size was significantly different from the typical values.

'confidence': The box plot for confidence, similar to the histogram, likely reinforces the concentration of data at the ends (low and high confidence), with some outliers in the middle range or beyond.

'acq_time': Depending on how 'acq_time' is represented numerically, the box plot could show if there are acquisition times that are significantly different from the usual patterns.

'version', 'type': These are likely represented numerically but are essentially categorical or ordinal. Their box plots might not be as informative as count plots for distribution, but they can still show the spread of other numerical features within each version/type category if plotted against them.

def remove_outliers_iqr(df, column):
  Q1 = df[column].quantile(0.25)
  Q3 = df[column].quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  df_cleaned = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)].copy()
  return df_cleaned

# Apply outlier removal to numerical columns
for col in numerical_cols:
  df = remove_outliers_iqr(df, col)

print("Shape after removing outliers:", df.shape)
Shape after removing outliers: (189370, 20)
# Visualize box plots after outlier removal
plt.figure(figsize=(12, 8))
sns.boxplot(data=df[numerical_cols])
plt.title('Box Plots for Numerical Features After Outlier Removal')
plt.ylabel('Value')
plt.show()

Box Plots (After):

The individual outlier points above and below the whiskers in the previous box plots have been significantly reduced or eliminated for the treated columns ('brightness', 'scan', 'track', 'bright_t31', 'frp').
The maximum and minimum values represented by the upper and lower whiskers will be much closer to the bulk of the data, as extreme values have been removed.
The scale of the y-axis in the box plots for the treated features is likely smaller, as it now focuses on the data within the calculated IQR range.
The boxes (IQR) and whiskers now represent the distribution of the majority of the cleaned data. While the IQR method removes values outside 1.5IQR from the quartiles, some data points beyond the whiskers might still be present, but they represent the less extreme values within the filtered dataset. The visual spread of the central 50% (the box) and the range covered by the whiskers (typically 1.5IQR) will be more representative of the data after removing the most extreme values.
For 'confidence', 'acq_time', 'version', and 'type', where outlier removal wasn't explicitly applied in the code snippet, their box plots would show similar distributions as before, potentially still displaying outliers if present in the original data.
df.head()
latitude	longitude	brightness	scan	track	acq_date	acq_time	satellite	instrument	confidence	version	bright_t31	frp	daynight	type	year	month	day_of_week	day_of_year	hour
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	Terra	MODIS	44	6.03	292.6	8.6	D	0	2021	1	4	1	40
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	Terra	MODIS	37	6.03	287.4	9.0	D	0	2021	1	4	1	54
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	Terra	MODIS	46	6.03	287.7	10.7	D	0	2021	1	4	1	54
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	Terra	MODIS	43	6.03	287.6	9.0	D	0	2021	1	4	1	54
6	31.4366	76.8988	300.5	1.0	1.0	2021-01-01	547	Terra	MODIS	36	6.03	287.2	5.3	D	0	2021	1	4	1	54
df.type.value_counts()
0    182841
2      6501
3        28
Name: type, dtype: int64
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols
Index(['satellite', 'instrument', 'daynight'], dtype='object')
# Select categorical columns for encoding
categorical_cols_to_encode = ['daynight', 'satellite', 'instrument']

# Apply One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)
df_encoded.head(100)
latitude	longitude	brightness	scan	track	acq_date	acq_time	confidence	version	bright_t31	frp	type	year	month	day_of_week	day_of_year	hour	satellite_Terra
0	28.0993	96.9983	303.0	1.1	1.1	2021-01-01	409	44	6.03	292.6	8.6	0	2021	1	4	1	40	1
1	30.0420	79.6492	301.8	1.4	1.2	2021-01-01	547	37	6.03	287.4	9.0	0	2021	1	4	1	54	1
3	30.0408	80.0501	302.0	1.5	1.2	2021-01-01	547	46	6.03	287.7	10.7	0	2021	1	4	1	54	1
4	30.6565	78.9668	300.9	1.3	1.1	2021-01-01	547	43	6.03	287.6	9.0	0	2021	1	4	1	54	1
6	31.4366	76.8988	300.5	1.0	1.0	2021-01-01	547	36	6.03	287.2	5.3	0	2021	1	4	1	54	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
116	23.7766	86.3997	313.8	1.0	1.0	2021-01-02	454	51	6.03	300.9	6.8	2	2021	1	5	2	45	1
117	23.6829	86.0831	310.4	1.1	1.0	2021-01-02	454	61	6.03	297.3	6.2	2	2021	1	5	2	45	1
118	23.6661	86.9215	308.2	1.0	1.0	2021-01-02	454	50	6.03	297.4	4.8	2	2021	1	5	2	45	1
119	23.8059	86.3222	313.5	1.0	1.0	2021-01-02	454	66	6.03	300.9	8.1	0	2021	1	5	2	45	1
120	23.8448	84.9512	310.7	1.2	1.1	2021-01-02	454	68	6.03	297.7	8.5	0	2021	1	5	2	45	1
100 rows × 18 columns

df_encoded.type.value_counts()
0    182841
2      6501
3        28
Name: type, dtype: int64
pip install folium - if needed use this
# !pip install folium
import folium

# Create map and sample data
india_map = folium.Map(location=[22.351115, 78.667743], zoom_start=5)
sample_df = df_encoded.sample(n=min(10000, len(df_encoded)), random_state=42)

# Add markers
for _, row in sample_df.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        color='red',
        fill=True,
        fill_opacity=0.6,
        popup=f"FRP: {row['frp']:.2f}, Date: {row['acq_date'].strftime('%Y-%m-%d')}"
    ).add_to(india_map)

display(india_map)
Make this Notebook Trusted to load map: File -> Trust Notebook

SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (ipython-input-1-3901379664.py, line 24)