## Train a Naive Bayes Classifier Model
The model will be trained using pandas and scikit-learn.
The model will be trained from data found at https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

I am running this through VS Code, using a docker container. You may also use the `ipynb` file in other Jupyter Notebook style setups. Consult Jupyter Notebook for options.

**To download the data, run this cell.**

Running this cell will download the data, if you are running it in the docker container. If not, you will need to navigate to the `KAGGLE_DATA_URL` and download the data manually.

In [2]:
import os
from data import download_kaggle_dataset
KAGGLE_DATA_URL = "https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction"
DATA_PATH = os.path.join(os.getcwd(), "data", "naive_bayes")
download_kaggle_dataset(KAGGLE_DATA_URL, DATA_PATH)

/workspaces/MS365/src/data/naive_bayes contains data. Delete the file(s) if you want to download again.


**Import the necessary python packages**

Import `pandas`, `sklearn.preprocessing.OneHotEncoder`, `sklearn.preprocessing.LabelEncoder`, `sklearn.preprocessing.OrdinalEncoder`, `sklearn.model_selection.train_test_split`, `sklearn.naive_bayes.MultinomialNB`, `sklearn.metrics.accuracy_score`, `sklearn.metrics.classification_report`, `sklearn.metrics.confusion_matrix`, `sklearn.metrics.ConfusionMatrixDisplay`, `imblearn.over_sampling.RandomOverSampler`, `imblearn.under_sampling.RandomUnderSampler`, and `matplotlib.pyplot`. Typically, packages such as `pandas` and `matplotlib.pyplot` are imported with an allias. I will not be following that strategy here. 

By default, `pandas` will truncate datasets with a lot of rows and a lot of columns. You can alter this functionality with the `set_options()` function. I have set it to show all possible columns. This could result in long run times for cells where you are displaying the data, if there are many columns to display. This will be expected behavior for this analysis.

If you are running the docker container or if you are using [Google Colab](https://colab.research.google.com/), the `pip install` has already been done. If not, then please consult your jupyter notebook environment docs for how to install the needed packages.

In [3]:
import pandas
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot
pandas.set_option("display.max_columns", None)

**Import the data for analayis**

There will be two files downloaded from the Kaggle site. The user uploaded a file with training data and a file with testing data. This example will go through the process of splitting the data into a training and testing set. Thus, the two files will need to be joined together, before they are separated again for the Naive Bayes model.

Use `pandas.read_csv` to import both files. The files were named `train.csv` and `test.csv`. The data from these two files will be imported to the variables `train_file` and `test_file`. The two variables can be joined to make a single dataframe using the `concat` method. The first paramter of `concat` is a list of the dataframes to be joined. There are other parameters that can be set. Please consult the method's [documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) for more information.

The concatenated data will be saved to the `df` variable. To view the data, use the `head()` method. The default value is `n=5`. I have used 15 to see the top 15 rows of data.


In [8]:
train_file = pandas.read_csv(os.path.join(DATA_PATH, "train.csv"))
test_file = pandas.read_csv(os.path.join(DATA_PATH, "test.csv"))

df = pandas.concat([train_file, test_file], ignore_index=True)
df.head(n=15)

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied
5,5,111157,Female,Loyal Customer,26,Personal Travel,Eco,1180,3,4,2,1,1,2,1,1,3,4,4,4,4,1,0,0.0,neutral or dissatisfied
6,6,82113,Male,Loyal Customer,47,Personal Travel,Eco,1276,2,4,2,3,2,2,2,2,3,3,4,3,5,2,9,23.0,neutral or dissatisfied
7,7,96462,Female,Loyal Customer,52,Business travel,Business,2035,4,3,4,4,5,5,5,5,5,5,5,4,5,4,4,0.0,satisfied
8,8,79485,Female,Loyal Customer,41,Business travel,Business,853,1,2,2,2,4,3,3,1,1,2,1,4,1,2,0,0.0,neutral or dissatisfied
9,9,65725,Male,disloyal Customer,20,Business travel,Eco,1061,3,3,3,4,2,3,3,2,2,3,4,4,3,2,0,0.0,neutral or dissatisfied
