<a href="https://colab.research.google.com/github/crabappleabby/nerdcentralstation/blob/main/Notebook1__ForestFire.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 1: Introduction to Machine Learning

Please make a copy of this notebook and share it with us once you have completed it. Our email addresses are: laura_lxm_ma@berkeley.edu
and aditi.telang@berkeley.edu



## Introduction to Machine Learning Model

This notebook will serve as an introduction to machine learning. Please read  [this](https://ischoolonline.berkeley.edu/blog/what-is-machine-learning/) article that defines machine learning (please read up to the deep learning section).

**Question 1:**

 Please edit this cell and state two use cases for machine learning below:





## Supervised Machine Learning Model

In our project will we be working with a supervised machine learning model. This means that the dataset we will be working with has been pre-labeled and classified by the user. For example, if we are trying to build a machine learning model that can classify text as either "negative" or "positive", a pre-classified dataset for a machine learning model may look something like this:


|   Text                                             | Sentiment |
|---------------------------------------------------|-----------|
| I loved the movie! It was fantastic.              | Positive  |
| This restaurant has terrible service.              | Negative  |
| The weather is beautiful today.                   | Positive  |
| The product arrived broken and unusable.          | Negative  |
| The new update improved the app's performance.    | Positive  |
| I'm really disappointed with their customer support.| Negative |
| What a great experience at the amusement park!    | Positive  |
| The quality of this product exceeded my expectations.| Positive |
| This book is poorly written and hard to follow.    | Negative  |
| The hiking trail was so disappointing.            | Negative  |



In this example, the "Text" column contains the input text data, and the "Sentiment" column contains the corresponding labels indicating whether the sentiment of the text is positive or negative. This data can be used to train a supervised machine learning model to predict sentiment labels for new, unseen text inputs.

The steps involved in supervised learning include:

**1. Data Collection and Preparation:**

Gather a dataset that consists of input data (features) and their corresponding labels (target values). The dataset should be diverse and representative of the real-world scenarios you're addressing. Preprocess the data to clean, transform, and format it for use in training.

**2. Data Splitting:**

Divide the dataset into two subsets: a training set and a test/validation set. The training set will be used to train the model, while the test/validation set is used to evaluate the model's performance on new, unseen data. This helps assess how well the model generalizes.

**3. Feature Engineering or Extraction:**

Convert the raw input data into numerical features that the machine learning algorithm can work with. Depending on the nature of the data, this could involve techniques like normalization, standardization, encoding categorical variables, and creating new features. We will worry less about this step within our project! But, it is important to know that different machine learning models have different requirements and assumptions about the format of the input data. For example, many machine learning algorithms cannot directly handle categorical features (e.g., "red," "green," "blue"). These features need to be turned into (encoded) into numerical values through techniques like one-hot encoding or label encoding, so the model can process them effectively.

**4.  Model Selection:**

Choose an appropriate machine learning algorithm or model architecture that suits your problem. The choice of algorithm depends on the type of task (classification, regression, etc.) and the characteristics of the data. We will explore this later on!

**5. Model Training:**

Feed the training data (feature representations of input data and their corresponding labels) into the chosen model. The model learns the underlying patterns and relationships between the features and labels during this phase. Training involves adjusting model parameters to minimize the difference between predicted and actual labels.

**6. Model Evaluation:**

After training, assess the model's performance on the test/validation set that it hasn't seen before. Use evaluation metrics to quantify how well the model is performing. Common evaluation metrics include accuracy, precision, recall, F1-score, and Mean Squared Error (MSE) (we will explore evaluation metrics in more detail later on!).

**7. Hyperparameter Tuning:**

Fine-tune the model's hyperparameters to optimize its performance. Hyperparameters are settings that are not learned during training but affect the learning process, such as learning rate, regularization strength, and number of layers. Techniques like grid search or random search can help identify optimal hyperparameters. Each machine learning model typically has its own set of hyperparameters. Hyperparameters are parameters that are not learned during the training process but are set before training and affect how the model learns from the data. They control various aspects of the learning algorithm and model architecture. When you start working on your machine learning model, you will learn more about that model's hyperparameters!

**8. Model Validation:**

Once the hyperparameters are tuned, validate the model's performance on a separate validation set. This step is crucial to ensure that the hyperparameter tuning process didn't lead to overfitting on the test set.

**9. Monitoring and Maintenance:**

Continuously monitor the model's performance in the real world. As new data becomes available, periodically retrain the model using updated data to ensure it remains accurate and relevant.

**Question 2:**

 Please edit this cell to include two new things you learned about training a supervised machine learning model:

### Training/Testing the Supervised Model

Training and testing the model is extremely important. Please read [this](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data) article on training and testing. It is common practice to have 80% of your data go to training the model, 10% go to testing your model, and the remaining 10% go to validating your model. But, you don't have to follow that split!


**Question 3:**

Please edit this cell to include why we don't use all of our data to train the model:





One of the main reasons we don't include all of our data to train the model is because of the concept of over-fitting, which refers to creating a model that matches the training data so closely that the model fails to make correct predictions on new data.


We essentially want the training set acts as a canvas for our models to learn from. We, therefore, use randomness to gather a diverse set of data samples, each representing a unique facet of the problem at hand. This random selection ensures that no single aspect dominates, allowing the model to grasp the broader landscape and make good predictions for unseen data. By maintaining a proportional representation of each sample, we prevent biases from distorting the model's understanding. This mosaic of diverse and balanced data equips our models to generalize effectively, adapting its knowledge to new scenarios with accuracy.

Overfitting is a big reason for including a test set. A test set is crucial for evaluating how well a trained machine learning model performs on new, unseen data. It helps us assess if the model's learned patterns generalize effectively, prevents overfitting, enables unbiased evaluation, aids in model selection and tuning, provides an estimate of real-world performance, and builds trust in the model's capabilities beyond the training data.

#### Exercise 1

For the last part of this notebook, you will choose a dataset from [Kaggle](https://www.kaggle.com/) and upload it into this notebook and split into a training and test set (we will not include a validation set for this exercise). The steps to complete this exercise include:
1. Create an account on [Kaggle](https://https://www.kaggle.com/) (a platform that data scientists often use to get datasets for exploration and the platform we will use for our project)
2. Choose a [dataset](https://https://www.kaggle.com/datasets) and download it (make sure it isn't too big!)
3. Uploading the dataset to this google collab
4. Splitting the dataset into a train/test set (you can choose to have 80% go to train and 20% go to test)

There are instructions on how to upload a dataset to this notebook below. There is also a coding cell at the bottom of the notebook that you can use to write your code.



##### Uploading a dataset to this notebook

**1. Upload the Dataset to Google Drive:**

Before uploading the dataset to Colab, it's recommended to first upload the dataset to your Google Drive. This allows for seamless integration between Google Drive and Google Colab.

**2. Mount Google Drive in Colab:**

In your Colab notebook, you need to mount your Google Drive to access the files you uploaded. Run the following code cell to mount Google Drive:

In [None]:
# Run this cell!
from google.colab import drive
drive.mount('/content/drive')

**3. Navigate to the Dataset:**

After mounting Google Drive, you can navigate to the directory where you uploaded the dataset. Use the file path based on the path in your Google Drive. Essentially, know where you put your dataset in your Google Drive.

**4. Load the Dataset:**

Once you've navigated to the directory, you can load the dataset using Python code. For example, if you have a CSV file named "data.csv" in a folder named "datasets," you can load it like this if the dataset is in the location: /content/drive/My Drive/datasets/data.csv


dataset_path = '/content/drive/My Drive/datasets/data.csv'

df = pd.read_csv(dataset_path)

**5. Start using the dataset stored in df!**

**Another helpful tip:**

To split your data randomly, look into the function sample!



In [None]:
# Write your code here!
import pandas as pd



You have finished this notebook!! Well-Done!!