# **Data preprocessing:**

In [1]:
#Read train.csv
import pandas as pd

data = pd.read_csv('train.csv')
Raw_X_train = data['Page content']
y_train = data['Popularity']

test = pd.read_csv('test.csv')
Raw_X_test = test['Page content']
X_test_id = test['Id']

print(Raw_X_train.head())
print(y_train.head())


0    <html><head><div class="article-info"> <span c...
1    <html><head><div class="article-info"><span cl...
2    <html><head><div class="article-info"><span cl...
3    <html><head><div class="article-info"><span cl...
4    <html><head><div class="article-info"><span cl...
Name: Page content, dtype: object
0   -1
1    1
2    1
3   -1
4   -1
Name: Popularity, dtype: int64


In [2]:
#Process HTML tags
from bs4 import BeautifulSoup

Untag_X_train = Raw_X_train.apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

Untag_X_test = Raw_X_test.apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

print(Untag_X_train.head())


0     Clara Moskowitz for Space.com 2013-06-19 15:0...
1    By Christina Warren2013-03-28 17:40:55 UTCGoog...
2    By Sam Laird2014-05-07 19:15:20 UTCBallin': 20...
3    By Sam Laird2013-10-11 02:26:50 UTCCameraperso...
4    By Connor Finnegan2014-04-17 03:31:43 UTCNFL S...
Name: Page content, dtype: object


# **Features extraction:**

In [3]:
import pandas as pd
import re

# Create lists to store extracted timestamps and topics
X_train_timestamps = []
X_test_timestamps = []
X_train_topics = []
X_test_topics = []

# Timestamp extraction pattern including multiple possible formats
timestamp_pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} (?:UTC|[-+]\d{4})|' + \
                    r'\d{4}/\d{1,2}/\d{1,2} \d{1,2}:\d{2}:\d{2} (?:AM|PM)|' + \
                    r'\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{2}:\d{2} (?:UTC|[-+]\d{4})|' + \
                    r'\d{4}/\d{1,2}/\d{1,2} \d{1,2}:\d{2}:\d{2} (?:UTC|[-+]\d{4}))'

# Topic extraction pattern
topic_pattern = r'Topics:\s*(.*)$'

# Normalize timestamps to a target format
def normalize_timestamp(timestamp):
    if not timestamp:
        return None
    # Handle formats that include AM/PM
    if 'AM' in timestamp or 'PM' in timestamp:
        dt = pd.to_datetime(timestamp, format='%Y/%m/%d %I:%M:%S %p', errors='coerce')
        return dt.tz_localize('UTC').strftime('%Y-%m-%d %H:%M:%S %Z')
    else:
        timestamp = timestamp.replace('/', '-')
        if 'UTC' in timestamp:
            dt = pd.to_datetime(timestamp, errors='coerce').tz_convert('UTC')
            return dt.strftime('%Y-%m-%d %H:%M:%S %Z')
        else:
            dt = pd.to_datetime(timestamp, errors='coerce')
            return dt.tz_convert('UTC').strftime('%Y-%m-%d %H:%M:%S %Z')

# Extract timestamps and topics from training data and normalize
for content in Untag_X_train:
    # Extract timestamp
    timestamp_match = re.search(timestamp_pattern, content)
    timestamp = timestamp_match.group(0) if timestamp_match else None
    normalized_timestamp = normalize_timestamp(timestamp)
    X_train_timestamps.append(normalized_timestamp)

    # Extract topics
    topic_match = re.search(topic_pattern, content)
    topics = topic_match.group(1) if topic_match else None
    X_train_topics.append(topics)

# Extract timestamps and topics from test data and normalize
for content in Untag_X_test:
    # Extract timestamp
    timestamp_match = re.search(timestamp_pattern, content)
    timestamp = timestamp_match.group(0) if timestamp_match else None
    normalized_timestamp = normalize_timestamp(timestamp)
    X_test_timestamps.append(normalized_timestamp)

    # Extract topics
    topic_match = re.search(topic_pattern, content)
    topics = topic_match.group(1) if topic_match else None
    X_test_topics.append(topics)

# Create DataFrames to store the extracted timestamps and topics
extracted_X_train = pd.DataFrame({
    'Timestamp': X_train_timestamps,
    'Topics': X_train_topics
})

extracted_X_test = pd.DataFrame({
    'Timestamp': X_test_timestamps,
    'Topics': X_test_topics
})

# Output the shape of the DataFrames
print(extracted_X_train.shape)
print(extracted_X_test.shape)

# Print some samples for debugging
print("Sample extracted training timestamps and topics:")
print(extracted_X_train.head())
print("Sample extracted test timestamps and topics:")
print(extracted_X_test.head())

# Check the number of missing values in the training data
missing_train = extracted_X_train.isnull().sum()
print("Missing values in training data:")
print(missing_train)

# Check the number of missing values in the test data
missing_test = extracted_X_test.isnull().sum()
print("Missing values in test data:")
print(missing_test)


(27643, 2)
(11847, 2)
Sample extracted training timestamps and topics:
                 Timestamp                                             Topics
0  2013-06-19 15:04:30 UTC  Asteroid, Asteroids, challenge, Earth, Space, ...
1  2013-03-28 17:40:55 UTC  Apps and Software, Google, open source, opn pl...
2  2014-05-07 19:15:20 UTC  Entertainment, NFL, NFL Draft, Sports, Televis...
3  2013-10-11 02:26:50 UTC                Sports, Video, Videos, Watercooler 
4  2014-04-17 03:31:43 UTC  Entertainment, instagram, instagram video, NFL...
Sample extracted test timestamps and topics:
                 Timestamp                                             Topics
0  2013-09-09 19:47:02 UTC  Entertainment, Music, One Direction, soccer, S...
1  2013-10-31 09:25:02 UTC  Gadgets, glass, Google, Google Glass, Google G...
2  2013-06-25 12:54:54 UTC           amazon, amazon kindle, Business, Gaming 
3  2013-02-13 03:30:21 UTC  Between Two Ferns, Movies, The Oscars, Oscars ...
4  2014-10-03 01:34:54 UTC

In [4]:
import pandas as pd
import re

# Assuming extracted_X_train and extracted_X_test already exist, and the 'Timestamp' column has been correctly extracted

# Combine the timestamp data from the training and test sets
combined_timestamps = pd.concat([
    extracted_X_train['Timestamp'],
    extracted_X_test['Timestamp']
], axis=0)

# Remove leading and trailing whitespace from timestamps
combined_timestamps = combined_timestamps.str.strip()

# Use a regular expression to extract only UTC formatted timestamps
def extract_utc_timestamps(timestamps):
    timestamp_pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} UTC|\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [+-]\d{4})'
    return timestamps.str.extract(timestamp_pattern)[0]

# Extract UTC timestamps
extracted_timestamps = extract_utc_timestamps(combined_timestamps)

# Ensure the 'Timestamp' column is in datetime format and handle UTC formatting
def parse_utc_timestamps(timestamps):
    parsed = pd.to_datetime(timestamps, errors='coerce', utc=True)
    return parsed

# Apply the parsing function
combined_timestamps_parsed = parse_utc_timestamps(extracted_timestamps)

# Extract time features
combined_time_features = pd.DataFrame()
combined_time_features['Year'] = combined_timestamps_parsed.dt.year
combined_time_features['Month'] = combined_timestamps_parsed.dt.month
combined_time_features['Day'] = combined_timestamps_parsed.dt.day
combined_time_features['Hour'] = combined_timestamps_parsed.dt.hour
combined_time_features['DayOfWeek'] = combined_timestamps_parsed.dt.dayofweek

# Check for missing values
print("Combined Time Features Missing Values:\n", combined_time_features.isnull().sum())

# Handle missing values - filling with 0 as an example
combined_time_features.fillna(0, inplace=True)

# Split the extracted features back into training and test sets
train_time_features = combined_time_features.iloc[:len(extracted_X_train)]
test_time_features = combined_time_features.iloc[len(extracted_X_train):]

# Output the processed DataFrames
print("Train Time Features (Numerical):")
print(train_time_features.head())
print("Test Time Features (Numerical):")
print(test_time_features.head())


Combined Time Features Missing Values:
 Year         1
Month        1
Day          1
Hour         1
DayOfWeek    1
dtype: int64
Train Time Features (Numerical):
     Year  Month   Day  Hour  DayOfWeek
0  2013.0    6.0  19.0  15.0        2.0
1  2013.0    3.0  28.0  17.0        3.0
2  2014.0    5.0   7.0  19.0        2.0
3  2013.0   10.0  11.0   2.0        4.0
4  2014.0    4.0  17.0   3.0        3.0
Test Time Features (Numerical):
     Year  Month   Day  Hour  DayOfWeek
0  2013.0    9.0   9.0  19.0        0.0
1  2013.0   10.0  31.0   9.0        3.0
2  2013.0    6.0  25.0  12.0        1.0
3  2013.0    2.0  13.0   3.0        2.0
4  2014.0   10.0   3.0   1.0        4.0


In [5]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Assuming extracted_X_train and extracted_X_test already exist

# Combine the 'Topics' data from the training and test sets
combined_topics = pd.concat([extracted_X_train['Topics'], extracted_X_test['Topics']], axis=0)

# Clean and standardize the 'Topics' data
combined_topics = combined_topics.str.strip().str.replace(r'\s+', ' ', regex=True)

# Split topics (assuming topics are separated by commas) and remove empty topics
combined_topics = combined_topics.str.split(', ').apply(lambda x: [topic for topic in x if topic])

# Use MultiLabelBinarizer for One-Hot encoding
mlb = MultiLabelBinarizer()
combined_topics_one_hot = mlb.fit_transform(combined_topics)

# Split the One-Hot encoded results back into training and test sets
train_topics_one_hot = pd.DataFrame(combined_topics_one_hot[:len(extracted_X_train)],
                                     columns=mlb.classes_)
test_topics_one_hot = pd.DataFrame(combined_topics_one_hot[len(extracted_X_train):],
                                    columns=mlb.classes_)

# Output the shape of the processed DataFrames
print("Train Topics One-Hot Columns:")
print(train_topics_one_hot.shape)
print("Test Topics One-Hot Columns:")
print(test_topics_one_hot.shape)


Train Topics One-Hot Columns:
(27643, 16951)
Test Topics One-Hot Columns:
(11847, 16951)


In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2

# Assuming train_combined is the prepared DataFrame containing One-Hot encoded topic features
# Assuming the target variable is 'Popularity', which is in extracted_X_train

# Combine other features (time features and One-Hot encoded topics)
train_notext = pd.concat([
    train_time_features.reset_index(drop=True),
    train_topics_one_hot.reset_index(drop=True),  # Includes One-Hot encoded topics
], axis=1)

# Split the dataset into training and validation sets
X_train_notext, X_val_notext, y_train_notext, y_val_notext = train_test_split(
    train_notext, y_train, test_size=0.0001, random_state=18
)

print(X_train_notext.shape)
print(X_val_notext.shape)

# Extract column names of topic features (assumed to be in train_topics_one_hot)
topic_columns = train_topics_one_hot.columns

# Perform Chi-Square test on topic features in the training set
chi2_values, p_values = chi2(X_train_notext[topic_columns], y_train_notext)

# Store results in a DataFrame
chi2_results = pd.DataFrame({
    'Feature': topic_columns,
    'Chi2 Value': chi2_values,
    'P-Value': p_values
})

# Filter out significant features (e.g., P-value < 0.1)
significant_features = chi2_results[chi2_results['P-Value'] < 0.1]['Feature']

# Rebuild the training and validation sets, keeping time features and significant topic features
X_train_notext = pd.concat([X_train_notext.drop(columns=topic_columns).reset_index(drop=True),
                            X_train_notext[significant_features].reset_index(drop=True)],
                           axis=1)

X_val_notext = pd.concat([X_val_notext.drop(columns=topic_columns).reset_index(drop=True),
                          X_val_notext[significant_features].reset_index(drop=True)],
                         axis=1)

# Print final shapes of training and validation sets with significant features
print("Final X_train shape with significant features:")
print(X_train_notext.shape)
print("Final X_val shape with significant features:")
print(X_val_notext.shape)

# If you need to inspect the merged feature columns
print("X_train_final columns:")
print(X_train_notext.columns)

# Perform the same process for the test set
# Assuming test_notext is the prepared test DataFrame
test_notext = pd.concat([
    test_time_features.reset_index(drop=True),
    test_topics_one_hot.reset_index(drop=True),  # Includes One-Hot encoded topics
], axis=1)

# Apply the same selection of significant features to the test set
test_notext = pd.concat([test_notext.drop(columns=topic_columns).reset_index(drop=True),
                         test_notext[significant_features].reset_index(drop=True)],
                        axis=1)

# Print final shape of the test set with significant features
print("Final X_test shape with significant features:")
print(test_notext.shape)


(27640, 16956)
(3, 16956)
Final X_train shape with significant features:
(27640, 703)
Final X_val shape with significant features:
(3, 703)
X_train_final columns:
Index(['Year', 'Month', 'Day', 'Hour', 'DayOfWeek', '10th anniversary',
       '2014 election', '20th century fox', '30 Days of Buzzwords',
       '4th amendment',
       ...
       'winamp', 'women', 'word', 'working from home', 'world series',
       'wrecking ball', 'wristwatches', 'xcom the bureau', 'year in review',
       'yec'],
      dtype='object', length=703)
Final X_test shape with significant features:
(11847, 703)


# **Building model and training:**

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Create a Random Forest classifier model
# Try different hyperparameter combinations
rf_model = RandomForestClassifier(n_estimators=100, max_depth=30, random_state=23)

# Train the model
rf_model.fit(X_train_notext, y_train_notext)
y_train_proba = rf_model.predict_proba(X_train_notext)[:, 1]  # Select probabilities for the positive class

# Calculate AUC for the training set
train_auc_score = roc_auc_score(y_train_notext, y_train_proba)

# Output the training AUC result
print(f"Training AUC: {train_auc_score:.4f}")

# Make probability predictions on the validation set
y_val_proba = rf_model.predict_proba(X_val_notext)[:, 1]  # Select probabilities for the positive class

# Calculate AUC for the validation set
auc_score = roc_auc_score(y_val_notext, y_val_proba)

# Output the validation AUC result
print(f"Validation AUC: {auc_score:.4f}")

# Make probability predictions on the test set
y_test_proba = rf_model.predict_proba(test_notext)[:, 1]

# Create a DataFrame containing the test set predictions
test_predictions_rf = pd.DataFrame({
    'Id': X_test_id,  # Assuming the test set contains an 'Id' column
    'Predicted_Popularity': y_test_proba
})

# Save the predictions to a CSV file
test_predictions_rf.to_csv('test_predictions_rf.csv', index=False)

print("Test predictions saved to 'test_predictions_rf.csv'.")


Training AUC: 0.8676
Validation AUC: 1.0000
Test predictions saved to 'test_predictions_rf.csv'.


# **Report:**

Student ID & name of each member:
113062624 張瑋倫

**How did you preprocess the data:**

First, remove the relevant tags from the html file. After doing so, you will get cleaner text data.

After removing the tags, you can observe that there are time stamps at the beginning of the text content, and the topic information is sorted out after 'Topics:' at the end of the text. The features of these two parts are specially extracted for processing.

I broke the time part into years, months, days, hours and days of the week, and directly used numerical data as the feature data received by subsequent models.

The topic content first performs onehot coding on all unique topics, and then uses chi-square coding to reduce the feature dimension of the compiled coding data. After doing this, all the data for training will be organized.

**How did you build the classifier:**

Directly use scikit learn's random forest function to construct the model. The hyperparameters n_estimators=100 and max_depth=20-30 have almost similar results. No special techniques were used in the model part because the performance of random forest was found to be sufficient after multiple cross comparisons.

**Conclusion:**

In this assignment, I fully realized the importance of feature screening for model training. At first, when I was doing text processing, I focused on word processing, especially the word embedding part. Later, after looking at the text data in detail, I found that several features can be specially extracted, namely author, time, image technique and theme. content. Under the combination of various features, I found that time is closely related to popularity, followed by topic. On the contrary, the content of the article is full of noise and cannot be applied to the overall training.

After extracting time and topic features, subsequent feature processing is also very important. At first, I did onehot coding on time. Although all time points can be forced to be classified, the disadvantage is that it cannot capture the continuity relationship of time. Changing it to a numerical value for direct processing has better results.

The chi-square test is used for the topic part, which can capture topics that are more relevant to popularity and serve as subsequent factors for consideration. The primary purpose of this is to reduce noise interference and avoid over-fitting. One thing to pay attention to here is to avoid information leakage, that is, perform the chi-square test on the validation set and the training set together. I didn't notice this at first and the AUC of the validation set reached 0.88. In the end, I found that it was completely wrong.

What I learned the most from this assignment was the observation and selection of features. Sometimes the devil is in the details. It takes a certain amount of experience and intuition to figure out the relationship between features and labels. There are also many other factors in the process. Misleading. Only by constantly trying can the results get better and better.

