<a href="https://colab.research.google.com/github/don-06don/MOOCs-Dataset/blob/main/01.MOOCs_Urgency_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1️⃣ Install Libraries & Mount Google Drive**

In this first step, we install and import all the necessary libraries for our project.
We also mount Google Drive to access our dataset stored there.
This ensures that all files are accessible and we have the tools required for preprocessing, feature extraction, and modeling.

In [2]:
import pandas as pd
from google.colab import drive
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from scipy.sparse import hstack
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
drive.mount('/content/drive')
df = pd.read_excel('/content/drive/MyDrive/moocs.xlsx')

Mounted at /content/drive


# **2️⃣ Create Binary Urgency Column**

The original "Urgency(1-7)" column contains values from 1 (low urgency) to 7 (high urgency).
For simplicity, we convert it into a binary column:

0 → low urgency (values < 4)

1 → high urgency (values ≥ 4)
*italicized text*
This will allow us to perform binary classification.

In [3]:
df["Urgency_binary"] = 0  #create new empty col.
for i in range(len(df)):   #loop b no. of rows
    value = df.loc[i, "Urgency(1-7)"]  #bngeb kol el old values
    if value < 4:
        df.loc[i, "Urgency_binary"] = 0  #.loc(row,cols) 3shan nwsl l df
    else:
        df.loc[i, "Urgency_binary"] = 1

print("Before: values in Urgency(1-7):", df["Urgency(1-7)"].unique()) #the old col.
print(df["Urgency(1-7)"].value_counts())

print("After: Unique values:", df["Urgency_binary"].unique()) #the new col.
print(df["Urgency_binary"].value_counts())

Before: values in Urgency(1-7): [1.5 3.5 2.5 3.  5.5 4.5 5.  1.  2.  6.5 4.  6.  7. ]
Urgency(1-7)
2.0    6427
2.5    4624
1.5    3946
1.0    3501
3.0    3308
5.0    2259
5.5    1990
3.5    1380
4.5     862
4.0     812
6.0     415
6.5      66
7.0      14
Name: count, dtype: int64
After: Unique values: [0 1]
Urgency_binary
0    23186
1     6418
Name: count, dtype: int64


# **3️⃣ Handle Missing Values**

We handle missing values to prevent errors in further processing:

- Drop rows with missing Text

 - Fill missing post_type with the most frequent category

 - Infer missing CourseType from course_display_name

In [4]:
df = df.dropna(subset=["Text"])
df["post_type"].fillna(df["post_type"].mode()[0])
df["CourseType"] = df["CourseType"].fillna(df["course_display_name"].str.split("/").str[0])

In [12]:
print("Missing in course type:", df["CourseType"].isna().sum())
print("Missing in post type:", df["post_type"].isna().sum())
print("Missing in text:", df["Text"].isna().sum())

Missing in course type: 0
Missing in post type: 0
Missing in text: 0


# **4️⃣ Encode Categorical Variables**

Machine learning models require numeric inputs, so categorical variables must be converted into numbers.
In this step, we transform the columns CourseType and post_type using LabelEncoder.

Why LabelEncoder?

 - It assigns a unique integer to each category.

 - Works well with nominal variables (no inherent order).

 - Integrates smoothly with other numeric features for models like KNN.

In [8]:
categorical_cols = ["CourseType", "post_type"]
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col].astype(str))

# **5️⃣ Text Feature Extraction (TF-IDF)**

We convert the post text into numerical features using TF-IDF.
TF-IDF reflects how important each word is in the dataset.
We limit features to 5000 and remove common English stop words.
Then, we combine these text features with the categorical features.

In [9]:
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_text = vectorizer.fit_transform(df["Text"].astype(str))
X_features = df[categorical_cols].astype(int).values
X_final = hstack([X_features, X_text])

# **6️⃣Train-Test Split & SMOTE**

We split the dataset into training (80%) and testing (20%) sets.
We apply SMOTE to balance the classes in the training set.
SMOTE generates synthetic samples for the minority class, improving model performance.



In [10]:
y = df["Urgency_binary"]
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# **7️⃣ Scale Sparse Data**

Scaling features is a crucial step for distance-based algorithms like K-Nearest Neighbors (KNN).
KNN calculates distances between data points, so if some features have larger scales than others, they will dominate the distance computation, which can degrade model performance.

Our feature matrix includes TF-IDF vectors, which are sparse (most values are zero). Sparse matrices cannot be mean-centered because subtracting the mean would destroy sparsity and increase memory usage.

To handle this, we use StandardScaler with with_mean=False. This scales each feature to have unit variance without centering it, preserving the sparsity of the data.

In [None]:
scaler = StandardScaler(with_mean=False)  # with_mean=False for sparse data
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

# **8️⃣Train KNN with GridSearchCV**

We train a K-Nearest Neighbors classifier and tune hyperparameters using GridSearchCV:


*    n_neighbors: number of neighbors


*    weights: uniform or distance-based


*    metric='cosine': suitable for high-dimensional text features







In [11]:
knn = KNeighborsClassifier( metric='cosine')
param_grid = {
    "n_neighbors": [3, 5, 7, 9, 11,15],
    "weights": ["uniform", "distance"]
}
grid = GridSearchCV(knn, param_grid, cv=5, scoring="f1", n_jobs=-1)
grid.fit(X_train_scaled, y_train_resampled)

best_knn = grid.best_estimator_
y_pred = best_knn.predict(X_test_scaled)



# **9️⃣KNN Model Evaluation**

Key Points:

The model predicts low urgency (class 0) fairly well (F1-score 0.82).

The model struggles with high urgency (class 1), with low precision (0.38) and recall (0.47).

Overall accuracy (~72%) is misleading because the dataset is imbalanced.

The evaluation shows that KNN with this setup is not optimal for detecting high urgency posts.

In [14]:
print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'n_neighbors': 3, 'weights': 'distance'}
Accuracy: 0.7218375274446884
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.79      0.82      4645
           1       0.38      0.47      0.42      1276

    accuracy                           0.72      5921
   macro avg       0.61      0.63      0.62      5921
weighted avg       0.75      0.72      0.73      5921

