<a href="https://colab.research.google.com/github/aaryyya/MLDL_Assignments/blob/main/Assignment9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Sample unstructured text data
unstructured_data = [
   "Berkshire retains and reinvests earnings when doing so delivers" ,
"at least proportional increases in per share market value over time.",
    "It uses debt sparingly and sells equity only when",
   "it receives as much in value as it gives.",
    "Buffett penetrates accounting conventions, especially",
    "those that obscure real economic earnings."

]

# Convert text data into numerical features using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(unstructured_data)

# Apply clustering to generate pseudo-labels
num_clusters = 2  # Assume we categorize data into two classes
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
pseudo_labels = kmeans.fit_predict(X)

# Create a structured dataset
df = pd.DataFrame({'Text': unstructured_data, 'Label': pseudo_labels})
print("Structured Data:")
print(df)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, pseudo_labels, test_size=0.2, random_state=42)

# Train a supervised model (Random Forest Classifier)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict labels for the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Predict labels for new unstructured data
new_data = ["Bufffet penetrates", "obscure tells u something insignificant paper"]
new_X = tfidf_vectorizer.transform(new_data)
new_predictions = clf.predict(new_X)

# Display predictions
for text, label in zip(new_data, new_predictions):
    print(f"Text: {text} => Predicted Label: {label}")


Structured Data:
                                                Text  Label
0  Berkshire retains and reinvests earnings when ...      0
1  at least proportional increases in per share m...      0
2  It uses debt sparingly and sells equity only when      0
3          it receives as much in value as it gives.      0
4  Buffett penetrates accounting conventions, esp...      0
5         those that obscure real economic earnings.      1
Model Accuracy: 1.00
Text: Bufffet penetrates => Predicted Label: 0
Text: obscure tells u something insignificant paper => Predicted Label: 0


**Explanation:**

1. **Data Preparation:**
   - **Import libraries:** Import necessary libraries like pandas, TF-IDF vectorizer, Logistic Regression, and evaluation metrics from scikit-learn.
   - **Sample data:** Create a list of unstructured text data (replace with your actual data).
   - **Text preprocessing:** Define a function to clean the text by removing punctuation and converting to lowercase. Apply this function to your data.
   - **Feature extraction:** Use TF-IDF vectorizer to convert text into numerical features.

2. **Labeling:**
   - **Manual labeling:** Manually assign labels to your data based on your understanding of the categories (e.g., positive/negative sentiment).

3. **Model Training:**
   - **Data splitting:** Split your data into training and testing sets.
   - **Model selection:** Choose a supervised learning algorithm like Logistic Regression.
   - **Training:** Train the model using the training data.

4. **Prediction on New Data:**
   - **Preprocess new data:** Apply the same preprocessing steps to any new data you want to label.
   - **Prediction:** Use the trained model to predict labels for the new data.

**Key points:**

* **Data Cleaning and Feature Engineering:** The most crucial step is to properly clean and transform unstructured data into meaningful features.
* **Labeling Strategy:** Depending on your data, you may need to develop a more sophisticated labeling strategy, possibly involving domain experts or active learning techniques.
* **Model Selection:** Choose a supervised learning algorithm suitable for your task (classification, regression) and data characteristics.
* **Evaluation:** Always evaluate your model's performance using metrics like accuracy, precision, recall, etc.
* **Real-world considerations:** For large datasets, consider techniques like data augmentation, advanced feature engineering, and hyperparameter tuning for better model performance.

**Important Note:** This example uses a simple text classification scenario. Adapt the preprocessing, feature engineering, and model selection based on the specific type of unstructured data you are working with.