# Text Classification for Active vs. Passive Voice Detection

### Objective: To develop a text classification model that can effectively detect whether a given sentence is in the active or passive voice, using a dataset of labelled sentences. This challenge aims to assess your skills in natural language processing, machine learning, and model explainability.

## 1. Importing Initial Libraries

In [1]:
import pandas as pd

In [2]:
df=pd.read_excel("immverse_ai_eval_dataset.xlsx")

In [3]:
df

Unnamed: 0,id,sentence,voice
0,1,The chef prepares the meal.,Active
1,2,The teacher explains the lesson clearly.,Active
2,3,The gardener waters the plants every morning.,Active
3,4,The kids play soccer in the park.,Active
4,5,The author wrote a thrilling novel.,Active
5,6,The scientist conducts experiments in the lab.,Active
6,7,The company launched a new product.,Active
7,8,The artist paints a beautiful portrait.,Active
8,9,The musician composes a melody.,Active
9,10,The photographer takes stunning pictures.,Active


# 2. Preprocessing the Dataset¶

In [4]:
import spacy

In [5]:
nlp = spacy.load("en_core_web_sm")

 #### spacy.load("en_core_web_sm") is used to load a pre-trained English language model in spacy,providing access to a range of NLP functionalities for text processing and analysis.

#### The code initializes spaCy (nlp) for text preprocessing

 * ##### Splitting text into words, punctuation marks, etc.
 * ##### Assigning word types to tokens, like verb, noun, adjective, etc.
 * ##### Reducing words to their base or root form.
   

In [6]:
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_space]
    return " ".join(tokens)

In [7]:
df['clean_sentence'] = df['sentence'].apply(preprocess_text)

* #### This function takes a raw sentence (text) as input, processes it using spaCy (en_core_web_sm model), performs lemmatization (token.lemma_), and returns the cleaned and lemmatized text.

In [8]:
X = df['clean_sentence']
y = df['voice'].map({'Active': 0, 'Passive': 1})

* ##### First line selects a column named 'clean_sentence' from a DataFrame df and assigns it to X.  X represents the features that will be used to train a machine learning model.
* ##### In second line, column from the DataFrame df, named 'voice', is being processed. y represents the target variable or labels that the model will try to predict.
* ##### The .map() function is used to convert these categories into numerical form, where 'Active' is mapped to 0 and 'Passive' is mapped to 1.

# 3. Train-Test Split

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_temp,y_train,y_temp = train_test_split(X,y, test_size=0.4, random_state=42)
X_val, X_test,y_val,y_test = train_test_split(X_temp,y_temp, test_size=0.5, random_state=42)

#### First Split (train_test_split with train_size=0.6):
##### The initial split (X_train, X_temp) divides the original DataFrame (df) into two subsets:
* ##### X_train contains 60% of the data (for training).
* ##### X_temp contains 40% of the data (combined validation and test sets).

#### Second Split (train_test_split on temp_df with train_size=0.5):
##### The second split (val_df, test_df) further divides temp_df into two independent subsets:
* ##### val_df contains 20% of the original data (for validation).
* ##### test_df also contains 20% of the original data (for testing).

In [11]:
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 24
Validation set size: 8
Test set size: 8


* #### Verified the output to confirm that the sizes of the datasets (train_df, val_df, test_df) match the desired proportions specified during the splitting process.
* #### The dataset is splitted into Training (60%), Validation (20%), and Test (20%) sets.

# 4. Initialize CountVectorizer and transform training and test data

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

#### The CountVectorizer is used to convert the text data into document-term matrices (X_train_vec and X_test_vec).
##### After vectorization (fit_transform on X_train and transform on X_test), X_train_vec and X_test_vec will be sparse matrices representing the document-term matrices for training and test data, respectively. space.

#### Fit and Transform on Training Data (X_train): 
* ##### vectorizer.fit_transform(X_train) fits the vectorizer on the training data (X_train) and transforms it into a document-term matrix (X_train_vec). During the fitting process (fit_transform), the vocabulary of unique words (or tokens) is learned from the training data (X_train).
#### Transform Test Data (X_test):
* ##### vectorizer.transform(X_test) transforms the test data (X_test) using the vocabulary learned from the training data (X_train). This ensures that the test data (X_test) is represented using the same vocabulary (features) obtained from the training data (X_train), preserving consistency in the feature space.

# 5. Using Logistic Regression Model (classifier) for binary classification.

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
classifier = LogisticRegression()

* #### This initializes a logistic regression classifier named classifier. The LogisticRegression class is used for binary and multiclass classification problems.


In [16]:
classifier.fit(X_train_vec, y_train)

* #### The fit method is used to train the logistic regression classifier on the training data (X_train_vec) with corresponding target labels (y_train), During training, the classifier learns the relationship between the input features (X_train_vec) and the target labels (y_train).
* #### The fit method of the classifier (classifier.fit(X_train_vec, y_train)) is a crucial step in supervised learning, where the model learns from the training data to make predictions on unseen data (e.g., test data).

In [17]:
y_pred = classifier.predict(X_test_vec)
y_pred

array([0, 0, 0, 1, 1, 0, 1, 0], dtype=int64)

* #### The predict method is used to make predictions on the test data (X_test_vec) using the trained classifier (classifier).
* #### The y_pred array contains the predicted labels generated by the classifier for the test data (X_test_vec).

# 6. Finding Accuracy & Evaluating the performance using Classification Report

In [18]:
from sklearn.metrics import accuracy_score, classification_report

In [19]:
accuracy_score(y_test, y_pred)

1.0

* #### accuracy_score(y_test, y_pred) calculates the accuracy score by comparing the predicted labels (y_pred) with the true labels (y_test).

* #### An accuracy score of 1.0 means all predictions were correct.

In [20]:
print(classification_report(y_test, y_pred, target_names=['Active', 'Passive']))

              precision    recall  f1-score   support

      Active       1.00      1.00      1.00         5
     Passive       1.00      1.00      1.00         3

    accuracy                           1.00         8
   macro avg       1.00      1.00      1.00         8
weighted avg       1.00      1.00      1.00         8



# 7. Creating A Function to check whether the given sentence is ACTIVE or PASSIVE

In [21]:
def predict_voice(sentence):
    cleaned_sentence = preprocess_text(sentence)
    sentence_vec = vectorizer.transform([cleaned_sentence])
    prediction = classifier.predict(sentence_vec)
    if prediction[0] == 0:
        return "Active"
    else:
        return "Passive"

* #### Defining a function predict_voice that takes a sentence as input, preprocesses it, vectorizes it using vectorizer, and then uses a trained classifier 'classifier' to predict whether the sentence is in "Active" or "Passive" voice.
* #### This function is a way to apply trained model to new, unseen text data.

In [22]:
new_sentence = input("Enter the sentence: ")
predicted_voice = predict_voice(new_sentence)
print(f"The given sentence is in: {predicted_voice} Voice")

Enter the sentence:  The cat chased the mouse


The given sentence is in: Active Voice


#### User Input (input):
* ##### new_sentence = input("Enter the sentence: ") prompts the user to enter a sentence interactively. The entered sentence is stored in the new_sentence variable. 
#### Voice Prediction (predict_voice):
* ##### predict_voice(new_sentence) calls the predict_voice function with the user-provided new_sentence.* ##### The function preprocesses, vectorizes, and predicts the voice type (either "Active" or "Passive") based on the trained model (vectorizer and classifier).
#### Display Prediction (print):
* ##### print(f"The given sentence is in: {predicted_voice} Voice") displays the predicted voice type ("Active" or "Passive") for the user-provided sentence..sentence.

# A detailed analysis of its strengths

#### 1. Accuracy on Validation/Test Data:
* The model demonstrates high accuracy (correctly predicts voice type) on unseen validation or test data, it indicates that the model generalizes well and can effectively differentiate between active and passive voice.
  

#### 2. Precision and Recall:
* High precision suggests that the model makes accurate predictions when it identifies a sentence as either active or passive.
* High recall indicates that the model effectively captures all instances of a particular voice type (active or passive), minimizing false negatives.

#### 3. Interpretability:
* Depending on the model type (ie. logistic regression), interpretability can be a strength, allowing for clear understanding of how features contribute to predictions.

#### 4. Scalability:
* Models that are efficient and scalable can handle larger datasets and may be easier to deploy in production.