## TASK1 - DATA EXPLORATION

In [8]:
# Task 1: Data Exploration

# Import necessary libraries
import pandas as pd

# Load the sample dataset
df = pd.read_excel('text_class.xlsx')

# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head()) ##head shows first 5 rows as default

# Print total number of rows and count of unique labels
print("\nTotal number of rows:", len(df))
print("Count of unique labels:", df['label'].nunique())


First 5 rows of the dataset:
                                                text     label
0                 I loved the product, it's amazing!  positive
1    Terrible service, I will never shop here again.  negative
2    The quality is good, but the delivery was late.   neutral
3  Absolutely wonderful experience, highly recomm...  positive
4  Product was damaged when it arrived, very disa...  negative

Total number of rows: 8
Count of unique labels: 3


## PREPROCESSING

In [3]:
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\devir_jnfy7nx\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\devir_jnfy7nx\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:

import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk



# Define a function to clean and preprocess text
def preprocess_text(text):
    # Converting text to lowercase
    text = text.lower()
    # Removing punctuation
    text = ''.join([word for word in text if word not in string.punctuation])
    # Tokenize text
    tokens = word_tokenize(text)
    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)

# Apply preprocessing to the first 5 rows of text data
df['processed_text'] = df['text'].apply(preprocess_text)

# Display the first 5 rows of processed data
print("\nProcessed top 5 rows:")
print(df[['text', 'processed_text']].head())



Processed top 5 rows:
                                                text  \
0                 I loved the product, it's amazing!   
1    Terrible service, I will never shop here again.   
2    The quality is good, but the delivery was late.   
3  Absolutely wonderful experience, highly recomm...   
4  Product was damaged when it arrived, very disa...   

                                     processed_text  
0                             loved product amazing  
1                       terrible service never shop  
2                        quality good delivery late  
3  absolutely wonderful experience highly recommend  
4              product damaged arrived disappointed  


## Explanation:
Preprocessing Steps:

Lowercasing: text.lower() converts all text to lowercase to ensure uniformity.

Removing Punctuation: We loop through the text and remove any punctuation using string.punctuation and a list comprehension.

Tokenization: word_tokenize(text) splits the text into individual words.

Removing Stopwords: We remove common words (stopwords) like "the", "is", "in", etc., using NLTK's stopwords.words('english').

Applying Preprocessing: The apply() function is used to apply the preprocessing function to the text column.



## Task 3 TRAIN A CLASSIFIER

In [6]:


# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the data into training and test sets (80% training, 20% testing)
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text to features using TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a simple Logistic Regression model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predict the labels on the test set
y_pred = model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy of the Logistic Regression model:", accuracy)



Accuracy of the Logistic Regression model: 0.5


## Explanation:
Data Splitting: The train_test_split() function splits the data into training (80%) and test (20%) sets.

Feature Extraction (TF-IDF):

The TfidfVectorizer() is used to convert the text data into numerical features (TF-IDF values).

fit_transform() is applied on the training data to learn the features, and transform() is applied on the test data to use the learned features.

Logistic Regression: We use a simple LogisticRegression() model to train the classifier on the training data.

Accuracy Calculation: The accuracy is calculated using accuracy_score() by comparing the predicted labels (y_pred) with the actual labels (y_test).

##TASK 4 Model evaluation

In [7]:


# Import necessary libraries for confusion matrix
from sklearn.metrics import confusion_matrix

# Evaluate the performance using confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Brief comment on confusion matrix:
print("\nThe confusion matrix provides insights by showing the number of true positives, true negatives, false positives, and false negatives. This allows us to understand where the model is making errors and how accurate it is in predicting each class.")



Confusion Matrix:
[[1 0]
 [1 0]]

The confusion matrix provides insights into how well the model is performing by showing the number of true positives, true negatives, false positives, and false negatives. This allows us to understand where the model is making errors and how accurate it is in predicting each class.


## Final Thoughts and Insights:
Model Evaluation: Based on the confusion matrix and accuracy score, you can decide if the model needs improvements or if different techniques (e.g., more data, different models, or hyperparameter tuning) should be explored.

Dataset Limitation: A small dataset of only 8 samples may not give reliable model performance. The model's performance is likely to improve with a larger dataset.