<a href="https://colab.research.google.com/github/earthianhivemind/DLlearning/blob/main/Week_2_building_your_first_text_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 2: Building Your First Text Classifier (A Guided Experience!)
Welcome to your hands-on activity for Natural Language Processing (NLP)!
In this Colab notebook, you'll act as an AI trainer, teaching a simple AI model to classify text.
You don't need to write any code. Just read the instructions, make a few text changes, and press the 'Play' button (▶) on each code cell to run it!


## Section 1: Introduction and Setup

Welcome to your first hands-on AI development experience!

Imagine you're trying to automatically sort incoming messages (like emails or customer inquiries) into different categories. This is a common task for **Natural Language Processing (NLP)**, specifically **Natural Language Understanding (NLU)**.

In this notebook, you will:
1.  **Define your own categories** for text.
2.  **Provide example sentences** (your 'training data') for each category.
3.  **'Train' your AI model** to learn from your examples.
4.  **'Test' your AI** by seeing how it classifies new sentences.

**How to use this notebook:**
* Read all the instructions in the grey text boxes (these are 'Markdown' cells).
* To run a code cell (the boxes with `In []:` next to them), click on the cell and then click the 'Play' button (▶) that appears on its left.
* The output of the code will appear directly below the cell.
* You will only need to change the text where we specify

### Installing Libraries
For this notebook, as well as our own code we will be using some packages of code made by other people, so we don't need to reinvent the wheel. These are from:
- Scikit-Learn (https://scikit-learn.org/stable/)

Read about this library to understand what it is used for - you don't need to be able to write code with it but you should be able to write a few sentences about why it was useful for this lesson at the end.

In [1]:
# @title Install necessary libraries
# This step might take a moment to run the first time.
# Text after a hash is a comment - this means it won't be executed as code. Lines starting with ! is specific to these notebooks and allows you to use extra features from outside of Python, such as installing packages using pip
!pip install scikit-learn

# Import the tools we'll need for our text classifier. We can use this code later.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from IPython.display import Markdown, display



## Section 2: Define Your Categories and Provide Training Examples

Here's where you define the 'brain' of your text classifier!

Think about a real-world scenario (e.g., customer service, HR, personal email inbox).
You need to:
1.  Choose **2-3 distinct categories** that your AI will learn to identify.
2.  For each category, provide **3-5 short, clear example phrases or sentences**. These are your 'training data' – the examples you'll use to 'teach' your AI.

**Instructions:**
* Look at the `training_data` variable below.
* Inside the square brackets `[]`, you will see examples for 'PRODUCT_INQUIRY' and 'TECHNICAL_SUPPORT'.
* **Modify these examples** to fit your chosen categories and phrases. You can also **add a third category** if you like!
* Keep the structure: `('YOUR CATEGORY NAME', 'Your example sentence.')`
* **IMPORTANT:** Make sure your example sentences for each category are distinct enough for the AI to learn from.

**Extra Information:**
* Below, we make use of variables to store data and allow us to refer to it easily later.
* `training_data` is a variable containing a List of the data that we will use to train our model, and this is denoted by the square brackets [].
* Each entry in `training_data` is a Tuple, meaning data contained in regular brackets ().
* Inside the tuple is a String, denoted by single or double quotation marks '/". This is a data type containing regular text.
* `texts` and `labels` are variables that contain just the text and just the labels of our training data respectively. We make use of a `for` loop to get this data out of the `training_data` list.
* We also make use of the `print()` function to show the contents of these variables. Similarly to the code we imported at the beginning, functions contain pre-written code that we can supply data to and use.

In [2]:
# @title Run this cell once you have made your changes

training_data = [
    # Category 1: Product Inquiry
    ('PRODUCT_INQUIRY', 'Tell me more about your new smartphone model.'),
    ('PRODUCT_INQUIRY', 'Where can I find details on the specifications of the XYZ printer?'),
    ('PRODUCT_INQUIRY', 'Do you have different color options for this laptop?'),
    ('PRODUCT_INQUIRY', 'What are the features of the latest software update?'),
    ('PRODUCT_INQUIRY', 'Is this item available in stock right now?'),

    # Category 2: Technical Support
    ('TECHNICAL_SUPPORT', 'My internet connection keeps dropping, what should I do?'),
    ('TECHNICAL_SUPPORT', 'How do I troubleshoot error code 404 on my device?'),
    ('TECHNICAL_SUPPORT', 'My computer screen suddenly went black.'),
    ('TECHNICAL_SUPPORT', 'I cannot connect to the Wi-Fi network.'),
    ('TECHNICAL_SUPPORT', 'The software application crashed unexpectedly.'),

    # Uncomment and modify the lines below to add a third category if you wish!
    # ('BILLING_QUESTION', 'I have a question about my last invoice.'),
    # ('BILLING_QUESTION', 'When is my monthly payment due?'),
    # ('BILLING_QUESTION', 'My recent payment did not go through.'),
    # ('BILLING_QUESTION', 'Can you explain my current charges?'),
    # ('BILLING_QUESTION', 'I need help understanding my bill.'),
]


texts = [text for category, text in training_data]
labels = [category for category, text in training_data]

print("Training data loaded successfully!")
print(f"Number of examples: {len(texts)}") # by using strings with an "f" at the start outside the quotation marks, you can reference variables and functions within the string, changing the output
print(f"Categories found: {set(labels)}")

Training data loaded successfully!
Number of examples: 10
Categories found: {'TECHNICAL_SUPPORT', 'PRODUCT_INQUIRY'}


## Section 3: Preparing the Data (Behind the Scenes)

Computers don't understand words like humans do. They understand numbers!

In this step, our AI tool will convert your sentences into a numerical format it can work with.
It does this by counting words and creating a unique numerical representation for each sentence.
Think of it like creating a giant spreadsheet where each row is a sentence and each column is a word from your training data, with numbers indicating how many times each word appears.

In [3]:
# This is the tool that converts text into numbers (counts words)
vectorizer = CountVectorizer()

# This command 'fits' the vectorizer to your training data and transforms it into numbers
X_train = vectorizer.fit_transform(texts)

print("Text data prepared for AI training!")
print(f"Example of numerical representation shape: {X_train.shape} (This means it found {X_train.shape[1]} unique words across your examples)")

Text data prepared for AI training!
Example of numerical representation shape: (10, 66) (This means it found 66 unique words across your examples)


In [4]:
print(vectorizer.get_feature_names_out())
print()
print(X_train.toarray())

['404' 'about' 'application' 'are' 'available' 'black' 'can' 'cannot'
 'code' 'color' 'computer' 'connect' 'connection' 'crashed' 'details'
 'device' 'different' 'do' 'dropping' 'error' 'features' 'fi' 'find' 'for'
 'have' 'how' 'in' 'internet' 'is' 'item' 'keeps' 'laptop' 'latest' 'me'
 'model' 'more' 'my' 'network' 'new' 'now' 'of' 'on' 'options' 'printer'
 'right' 'screen' 'should' 'smartphone' 'software' 'specifications'
 'stock' 'suddenly' 'tell' 'the' 'this' 'to' 'troubleshoot' 'unexpectedly'
 'update' 'went' 'what' 'where' 'wi' 'xyz' 'you' 'your']

[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
  0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0

`CountVectorizer()` is code we got from the scikit-learn library earlier. It turns the text from `training_data` into a "bag-of-words". It finds all the different words across the entries in `training_data`, then for each entry it gives us a list such as the following:

`[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]`

Where the first position is "404", the second position is "about", the third is "application", etc...

If there is a 0 in that position, then the word isn't present in that entry, and if there is a 1 then the word is present.

## Section 4: Training Your AI (The Learning Phase)

Now for the exciting part: **training your AI!**

Here, the AI model (called `MultinomialNB` in this case) will learn the patterns from your prepared numerical data.

The more good, diverse examples you provide in Section 2, the better your AI will learn!

In [5]:
# This is our AI model that will learn to classify text
classifier = MultinomialNB()

# This command 'trains' the classifier using your prepared data (X_train) and your labels
classifier.fit(X_train, labels)

print("Your AI model has been trained successfully!")

Your AI model has been trained successfully!


In [9]:
# @title Section 5: Test Your AI (Making Predictions)
# @markdown Great! Your AI is now trained. Let's see how smart it is!

# @markdown **Instructions:**
# @markdown * Type a new sentence (that you *didn't* use in training) into the box below.
# @markdown * Click the 'Play' button (▶) on the cell to see how your AI classifies it.
# @markdown * Try different sentences and see if it gets them right.

# @markdown ### ✏️ Your Turn: Test Your AI Here! Type a new sentence in the box below.
test_sentence = "I need to how to change my phone" # @param {type:"string"}

# Prepare the test sentence in the same numerical format as the training data
X_test = vectorizer.transform([test_sentence])

# Make a prediction using your trained AI
predicted_category = classifier.predict(X_test)[0]

print(f"The AI predicts this sentence belongs to the category: {predicted_category}")

The AI predicts this sentence belongs to the category: TECHNICAL_SUPPORT


## Section 6: Reflect and Next Steps

You've just built and tested your first text classifier!

Now, head back to the Unit 2 discussion forum and share your experience:
* What did the libraries we used provide?
* What categories did you choose for your AI classifier, and why?
* Provide one example of a training phrase you used for each category.
* Provide one new test phrase you typed into the Colab notebook, and the category your AI assigned to it. Was the result what you expected?
* Based on this hands-on experience, how do you think the **quality, quantity, and diversity of your "training data"** (the examples you provided) would affect the AI's ability to accurately understand and categorize *new, real-world* text?
* How could an AI tool like this, capable of understanding and classifying text, solve a real problem in your professional setting, making processes more **human-centered** (e.g., by freeing up human staff for complex issues, or speeding up customer service by directing inquiries correctly)?

Congratulations on completing your first practical AI development activity!