<a href="https://colab.research.google.com/github/am88tech/gen-ai-ml/blob/main/notebook/assignment/Word2Vec_Assignment_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step-1 : Import the data
https://drive.google.com/file/d/1vZ4S0dtiUk5LqeeccqWs9IAAG8qH1GWv/view?usp=sharing

it will be a zipped file, unzip News_Category_Dataset.zip



1. **Download the Dataset**:
   - Use the `gdown` command with the `--id` parameter to download the dataset from Google Drive. Replace `1vZ4S0dtiUk5LqeeccqWs9IAAG8qH1GWv` with the specific ID of your file on Google Drive to ensure the correct file is downloaded.

2. **Unzip the Dataset**:
   - After downloading, unzip the dataset using the `!unzip` command followed by the filename. In this case, the file to unzip is `News_Category_Dataset.zip`. This step will extract the JSON file needed for data loading.

3. **Load the Dataset**:
   - Import the `pandas` library (if not already imported) using `import pandas as pd` to handle the dataset. Load the JSON file using `pd.read_json`, specifying the path `/content/News_Category_Dataset_v3.json` and setting `lines=True` to correctly format the dataset as each JSON object is stored on a separate line.

4. **Select Relevant Columns**:
   - From the loaded dataset, select only the columns 'headline' and 'category' for further analysis. Ensure that any missing values in these columns are removed by using the `.dropna()` method. This will help in maintaining the quality and consistency of the data being analyzed.

5. **Preprocess Text Data**:
   - Convert the text in the 'headline' column to string type to standardize the format for textual analysis. This is achieved by applying `.astype(str)` to the 'headline' column, which ensures that all entries are treated as strings.

6. **Filter and Display Selected Categories**:
   - Define a list of categories of interest (e.g., 'POLITICS', 'ENTERTAINMENT', 'BUSINESS', 'SPORTS'). Filter the dataset to include only these categories by checking if the 'category' column values are in the predefined list `top_categories`.
   - Display the first few headlines and the count of entries per category in the filtered dataset to verify the filtering process and to get a preliminary view of the data distribution among these top categories. Use `print(data["headline"].head())` and `print(data['category'].value_counts())`.



In [5]:

!gdown --id 1vZ4S0dtiUk5LqeeccqWs9IAAG8qH1GWv
!unzip News_Category_Dataset.zip


Downloading...
From (original): https://drive.google.com/uc?id=1vZ4S0dtiUk5LqeeccqWs9IAAG8qH1GWv
From (redirected): https://drive.google.com/uc?id=1vZ4S0dtiUk5LqeeccqWs9IAAG8qH1GWv&confirm=t&uuid=15b877e4-c07e-4d4e-b97b-17de6678364e
To: /content/News_Category_Dataset.zip
100% 27.8M/27.8M [00:00<00:00, 69.2MB/s]
Archive:  News_Category_Dataset.zip
replace News_Category_Dataset_v3.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
import pandas as pd
data = pd.read_json("/content/News_Category_Dataset_v3.json", lines=True)
# Select relevant columns and remove missing values
data = data[['headline', 'category']].dropna()

# Convert 'headline' column to string type
data['headline'] = data['headline'].astype(str)

# Define top categories
top_categories = ['POLITICS', 'ENTERTAINMENT', 'BUSINESS', 'SPORTS']

# Filter data based on top categories
data = data[data['category'].isin(top_categories)]

# Display first few headlines and category counts
print(data["headline"].head())
print(data['category'].value_counts())

In [None]:
'''
Final Expected Output

category
POLITICS         35602
ENTERTAINMENT    17362
BUSINESS          5992
SPORTS            5077
'''

#Step-2: Pre-process, Tokenize and Prepare Train and Test data

1. **Initialize the Tokenizer**:
   - Start by creating an instance of the `Tokenizer` from the TensorFlow library. This tokenizer will be used to convert text data into sequences of integers, which are more manageable for model processing.

2. **Fit the Tokenizer**:
   - Fit the tokenizer on the 'processed_text' column of your dataset. This step allows the tokenizer to learn and map the vocabulary of your texts, essential for transforming text into a numerical format.

3. **Convert Text to Sequences**:
   - Use the tokenizer to convert the texts in the 'processed_text' column into sequences. Each text will be transformed into a sequence of integers where each integer represents a unique word in the learned vocabulary.

4. **Set Sequence Length**:
   - Define a maximum sequence length (100 in this case) to standardize the size of the input data. This helps in handling variability in text length across your dataset.

5. **Pad Sequences**:
   - Adjust the sequences to a consistent length using `pad_sequences`. This function will truncate sequences longer than the maximum length and pad shorter ones with zeros. The result is a uniform input shape for modeling, stored in variable `X`.

6. **Encode Labels and Split Data**:
   - Convert categorical labels in the 'category' column into numerical form using `LabelEncoder`, making them suitable for training a TensorFlow model. Then, split the dataset into training and testing sets with the `train_test_split` method, using 20% of the data for testing, ensuring that your model is trained and evaluated on different subsets of data.

In [5]:
#Write your code here
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['headline'])
sequences = tokenizer.texts_to_sequences(data['headline'])

max_length = 100  # Maximum length of a complaint narrative
X = pad_sequences(sequences, maxlen=max_length)
y = data['headline']

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [None]:
'''
Final Expected Output

Shape of X_train: (51226, 100)
Shape of y_train: (51226,)
Shape of X_test: (12807, 100)
Shape of y_test: (12807,)
'''

#Step 3: Configure the model


1. **Calculate Vocabulary Size**:
   - Determine the size of the vocabulary by counting the total number of unique words in the text data, which is obtained from the `tokenizer.word_index`. Add one to this count to account for the zero index in TensorFlow.

2. **Initialize the Sequential Model**:
   - Create an instance of the `Sequential` model from TensorFlow's Keras API. This sets up a linear stack of layers in the neural network, to which you will add different types of layers.

3. **Add Embedding Layer**:
   - Insert an `Embedding` layer first in the model to convert integer sequences (tokens) into dense vectors of fixed size. Set the `input_dim` to the vocabulary size, `output_dim` to 16/32/64 to define the vector space dimensionality, and `input_length` to the maximum length of input sequences.

4. **Incorporate GlobalAveragePooling1D Layer**:
   - Include a `GlobalAveragePooling1D` layer following the embedding layer. This layer reduces the dimensionality of the model by calculating the average output of each dimension across the sequence, which helps in minimizing overfitting.

5. **Add Dense Hidden Layer**:
   - Add a `Dense` layer with 32/64/128 neurons, using the 'ReLU' activation function. This layer serves as the hidden layer and provides the model with the ability to learn non-linear relationships in the data.

6. **Configure Output Layer**:
   - Finally, add another `Dense` layer, this time with the number of units equal to the number of unique labels in your dataset, using 'softmax' activation. This layer will output a probability distribution over the class labels, making it suitable for multi-class classification. End by printing the model summary to review the architecture and parameters of your network.



In [4]:
#Write your code here
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense

vocab_size = len(tokenizer.word_index) + 1  # Vocabulary size
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=max_length))
model.add(GlobalAveragePooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dense(len(label_encoder.classes_), activation='softmax'))

NameError: name 'tokenizer' is not defined

In [None]:
'''
Final Expected Output

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 embedding_2 (Embedding)     (None, 100, 32)           1338496

 global_average_pooling1d_1  (None, 32)                0
  (GlobalAveragePooling1D)

 dense_2 (Dense)             (None, 64)                2112

 dense_3 (Dense)             (None, 4)                 260

=================================================================
Total params: 1340868 (5.12 MB)
Trainable params: 1340868 (5.12 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
'''

# Step4- Compile and Train the Model

1. **Compile the Model**:
   - Set up your model for training by compiling it with the necessary configurations: use 'adam' as the optimizer for its efficiency in handling sparse gradients and adaptive learning rate capabilities; choose 'sparse_categorical_crossentropy' as the loss function suited for multi-class classification tasks where labels are integers; and select 'accuracy' as the metric to monitor the model's performance during training.

2. **Configure Training Parameters**:
   - Specify the parameters for training the model: `epochs` defines how many times the model will work through the entire training dataset; `batch_size` determines the number of samples to work through before updating the internal model parameters; use `validation_data` to provide the test dataset for evaluating the model after each epoch.

3. **Start Model Training**:
   - Begin the training process by calling the `model.fit` method with your training dataset (`X_train` and `y_train`), along with the number of epochs, batch size, and validation data. This process iteratively adjusts the model weights to minimize the loss and improve accuracy on the training data.

4. **Monitor Training Progress**:
   - Observe the output during training to monitor progress. This output includes loss and accuracy metrics for both training and validation sets, providing insight into how well the model is learning and generalizing to new data.

5. **Save Model Weights**:
   - After training, save the model weights to a file using `model.save_weights('model.h5')`. This allows the trained model configuration to be preserved, which can be useful for deployment or further evaluation without needing to retrain.

6. **Reload Model Weights**:
   - If needed, reload the model weights from the saved file with `model.load_weights('model.h5')` to resume training, make predictions, or perform evaluations. This step ensures that the model's state can be restored or transferred without loss of fidelity.


In [7]:
#Write your code here
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, batch_size=128, validation_data=(X_test, y_test))

Epoch 1/5
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 393ms/step - accuracy: 0.0010 - loss: 11.0792 - val_accuracy: 0.0014 - val_loss: 11.1910
Epoch 2/5
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 328ms/step - accuracy: 0.0011 - loss: 11.0015 - val_accuracy: 0.0014 - val_loss: 11.4208
Epoch 3/5
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m161s[0m 383ms/step - accuracy: 0.0015 - loss: 10.9364 - val_accuracy: 0.0014 - val_loss: 11.6400
Epoch 4/5
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m138s[0m 373ms/step - accuracy: 0.0013 - loss: 10.8881 - val_accuracy: 0.0014 - val_loss: 11.8485
Epoch 5/5
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m133s[0m 380ms/step - accuracy: 0.0015 - loss: 10.8535 - val_accuracy: 0.0014 - val_loss: 12.0473


<keras.src.callbacks.history.History at 0x7b46480bd650>

In [3]:
# Save the model weights
model.save_weights('complaints_model.weights.h5')

# Load the model weights
model.load_weights('complaints_model.weights.h5')


NameError: name 'model' is not defined

In [None]:
'''
Final Expected Output

Epoch 1/5
401/401 [==============================] - 9s 20ms/step - loss: 1.0822 - accuracy: 0.5611 - val_loss: 0.9576 - val_accuracy: 0.6851
Epoch 2/5
401/401 [==============================] - 9s 24ms/step - loss: 0.7040 - accuracy: 0.7567 - val_loss: 0.5884 - val_accuracy: 0.7765
Epoch 3/5
401/401 [==============================] - 8s 19ms/step - loss: 0.4570 - accuracy: 0.8308 - val_loss: 0.4467 - val_accuracy: 0.8428
Epoch 4/5
401/401 [==============================] - 10s 26ms/step - loss: 0.3348 - accuracy: 0.8898 - val_loss: 0.4037 - val_accuracy: 0.8565
Epoch 5/5
401/401 [==============================] - 9s 23ms/step - loss: 0.2668 - accuracy: 0.8870 - val_loss: 0.3747 - val_accuracy: 0.8709
'''

# Step 5: Evaluate the Model


1. **Generate Predictions**:
   - Use the `model.predict` method on `X_test` to obtain the probabilities for each class. Apply `np.argmax` with `axis=1` to convert these probabilities into actual class predictions, `y_pred`, which indicates the class with the highest probability for each test sample.

2. **Import and Compute the Confusion Matrix**:
   - Import the `confusion_matrix` function from `sklearn.metrics`. Then, calculate the confusion matrix using the true labels (`y_test`) and the predicted labels (`y_pred`). This matrix will help visualize the accuracy of the predictions across different classes, showing the number of correct and incorrect predictions for each class.

3. **Calculate and Display Accuracy**:
   - Compute the overall accuracy of the model by comparing `y_test` and `y_pred` using the `accuracy_score` function. Display this value to understand the proportion of correctly predicted instances among the total instances in the test set.

4. **Print the Classification Report**:
   - Use the `classification_report` function from `sklearn.metrics` to generate a detailed classification report. This report includes metrics such as precision, recall, and F1-score for each class, which are crucial for assessing model performance, especially in multi-class classification tasks.

5. **Display Class Names in Reports**:
   - Provide the `target_names` parameter with class labels from `label_encoder.classes_` to the `classification_report` function. This makes the report more readable and informative by displaying the actual names of the classes instead of numerical labels.

6. **Review Model Performance**:
   - Examine the confusion matrix and the classification report printed in the console to review how well the model performs on different classes. Use this analysis to identify any biases or weaknesses in the model, such as consistently misclassified classes, which could guide further refinement and improvements.



In [1]:
# Write your code here
from sklearn.metrics import accuracy_score, classification_report
import numpy as np


y_pred = np.argmax(model.predict(X_test), axis=1)

#Confusion Matrix
cm= tf.math.confusion_matrix(y_test, y_pred)
print(cm)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print("Classification Report:")
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
print(report)

NameError: name 'model' is not defined

In [None]:
'''
Expected Output

Confusion Matrix:
[[ 675   85  461   70]
 [  33 3140  256   41]
 [ 122  242 6661   15]
 [  90  161   77  678]]
Accuracy: 0.8709299601780276
Classification Report:
               precision    recall  f1-score   support

     BUSINESS       0.73      0.52      0.61      1291
ENTERTAINMENT       0.87      0.90      0.88      3470
     POLITICS       0.89      0.95      0.92      7040
       SPORTS       0.84      0.67      0.75      1006

     accuracy                           0.87     12807
    macro avg       0.83      0.76      0.79     12807
 weighted avg       0.87      0.87      0.87     12807
 '''

# Step-6 Making a prediction on new news articles

In [None]:
news_new = [
    """
    LOS ANGELES -- With the bases loaded and two outs against one of baseball’s
    nastiest relievers, MJ Melendez fought off pitch after pitch … after pitch after pitch … to
    keep the at-bat alive in hopes of coming through in the Royals’
    best scoring opportunity on Saturday night.
    """,
    """
    Biden campaign rakes in $28 million for star-studded Los Angeles fundraiser
    The massive haul was announced just hours before President Joe Biden appeared
    alongside former President Barack Obama, George Clooney and others.
    """
]

1. **Prepare New Input Data**:
   - Create a list of new news articles, `news_new`, where each entry is a string containing the text of the news article. This example includes articles about a baseball game and a political fundraiser.

2. **Tokenize New Texts**:
   - Use the `tokenizer` that was previously fitted on your training data to convert the new news texts (`news_new`) into sequences of integers. This process transforms the raw text into a format that the neural network can process.

3. **Pad Sequences**:
   - Pad the newly created sequences (`new_sequences`) to ensure they all have the same length, `max_length`, as defined during the training process. Use the `pad_sequences` function, setting `maxlen` to `max_length`. This standardization is necessary for consistent input size into the neural network.

4. **Predict Class Probabilities**:
   - Employ the trained model to predict the class probabilities for the padded sequences (`new_X`). The `model.predict` function will output a list of probabilities for each class for each article.

5. **Determine Predicted Classes**:
   - Extract the predicted class indices by finding the index of the maximum probability in each set of predictions. This is achieved using `np.argmax` across `axis=1` of `new_predictions`, resulting in a list of the most likely class indices for each article.

6. **Translate Indices to Labels**:
   - Convert the predicted class indices (`pred_class`) back into readable class labels using the `label_encoder`'s `inverse_transform` method. This step maps the numerical indices back to their corresponding categorical labels.
   - Finally, print both the predicted class indices and their corresponding labels to see the classification results for the new articles.



In [None]:
new_sequences = tokenizer.texts_to_sequences(news_new)
new_X = pad_sequences(new_sequences, maxlen=max_length)
new_predictions = model.predict(new_X)
pred_class=np.argmax(new_predictions, axis=1)
print(pred_class)

In [None]:
'''
Expected output
[3 2]
['SPORTS' 'POLITICS']
'''