 Part 1 (30 points) - a Jupyter notebook containing
– a description (in words, no code) of the steps you followed to arrive at your predictions
and your estimates of prediction quality - including a description of any separation of
your training data into training and testing data, method you used for imputation,
methods you tried to use for making predictions (e.g. regression, logistic regression, ...)
followed by– the code you used in your calculation

**Description**

**Data cleaning and analyzing**

Once the datasets were loaded into pandas DataFrames, I applied these median values to fill any missing entries within these columns in both the training and test datasets. This method ensured that the model would not be biased or flawed due to missing data and maintained consistency in data handling between training and testing phases. I confirmed the effectiveness of my cleaning by verifying that no missing values remained in the numeric columns of both datasets. 

**Method**

**P1**

For Problem 1, my approach to predict the page length of Wikipedia articles involved comparing three different regression models: XGBoost, RandomForest, and Linear Regression.

**XGBoost Regression:** I initiated the analysis with an XGBoost regressor, configured with parameters suited for tackling regression tasks. The model was trained on a subset of the data, with subsequent validation to assess prediction accuracy. The XGBoost model yielded MAE values from cross-validation, providing with insight into the model's consistency and predictive performance.

**RandomForest Regression:** Parallel to XGBoost, I employed a RandomForestRegressor with 100 trees to predict page lengths. The RandomForest model was similarly assessed through both validation and 5-fold cross-validation. This model demonstrated slightly better MAE results, suggesting a more accurate prediction on average compared to XGBoost.

**Linear Regression:** As a baseline comparison, I also implemented a Linear Regression model. This simpler approach allowed me to gauge how well more complex models performed against basic linear analysis. The Linear Regression was evaluated through cross-validation to ascertain its efficacy and reliability, turns out the random forest still better.

For each model, predictions were made on a separate test dataset, and results were compiled into a DataFrame, highlighting the predicted lengths along with the corresponding URL IDs. The predictions were then saved to a CSV file for further evaluation or operational use. This multi-model analysis framework enabled me to not only select the model that best meets our accuracy and robustness criteria but also to provide a clear comparative perspective across different regression methodologies.

**P2**

Data Preparation: I prepared the dataset by selecting relevant features and target variables. For predicting the presence of specific words and whether a page was edited in 2023, I extracted the first 40 feature columns from the dataset, excluding non-numeric identifiers like 'URLID'. i also created binary target variables such as 'word_present' and 'edited_2023' based on the content and editing date information provided in the dataset.

**Model Selection and Training:** I used the same models as in P1 to ensure a comprehensive analysis. Each model was trained on a portion of the dataset reserved for training, using 80% of the data, while the remaining 20% was set aside for validation.

**Validation and Model Tuning:** Post initial training, I validated each model using the reserved validation dataset. I focused on optimizing the models based on the ROC curve to find the optimal threshold that minimized the False Positive Rate (FPR) to our target of 0.05 while maximizing the True Positive Rate (TPR). Additionally, I utilized GridSearchCV to fine-tune the hyperparameters of the RandomForest model, aiming to enhance its recall capability.

**Performance Evaluation:** I assessed the quality of the predictions using Mean Absolute Error (MAE) for numerical predictions and confusion matrix metrics for classification tasks. 

**Cross-Validation:** To ensure that the model's performance was not a result of overfitting to the validation set, I implemented 5-fold cross-validation. This technique provided a more reliable estimate of model accuracy across different subsets of data.

**Threshold Application and Final Predictions:** Using the best-performing models, I applied the determined optimal thresholds to the test dataset to classify each page accordingly. The final predictions were compiled into a DataFrame, clearly linking the predictions with the respective 'URLID' of each page.

**P3**

**Data Preparation:**
First, I converted the 'date' column of the dataset into a binary target variable 'edited_2023'. This involved parsing the dates to extract the year and checking if it was 2023, which was then encoded as 1 (edited in 2023) or 0 (not edited in 2023). 

**Feature Selection and Dataset Splitting:**
I utilized the first 40 columns of the dataset, assuming these included relevant features that could influence the editing status of a page. The dataset was then split into training and validation sets, using an 80/20 split to ensure sufficient data for training while still allowing for robust validation of model performance.

**Model Selection and Training:**
I implemented and compared several models including RandomForest, XGBoost, and Logistic Regression. Each model was trained on the training data, with parameters tuned to optimize for recall, given the importance of identifying as many true positives as possible without excessively sacrificing precision.

**Optimization and Validation:**
For each model, after training, I predicted the probabilities of the pages being edited in 2023 on the validation set. Then computed the ROC curve to find the optimal threshold that minimized the FPR to approximately 0.05, aiming to reduce the likelihood of false alerts while maintaining a high TPR. 

**Hyperparameter Tuning:**
Using GridSearchCV, I conducted hyperparameter tuning for the RandomForest model to find the best parameters that improve recall. This included adjusting the number of trees, tree depth, and class weights, focusing on enhancing the model's ability to detect the minority class effectively.

**Application to Test Data:**
With the optimized models and chosen thresholds, I applied my predictions to the test dataset to classify whether each page was edited in 2023. The results were compiled into a DataFrame, linking each prediction with its corresponding 'URLID' for easy reference and analysis.


**Getting and storing the pickled datasets**

Your two datasets are stored online as pickled pandas data frames. 

These are binary files and you have been given the urls for these files. 

The following code can be used to download one of these binary files and write it to a file on your own computer.

To use this code, you need to substitute **your own url** and a file name you want to use to store the file on your computer.

You will need to do this for each of the two files you have been provided with.

In [1]:
import requests as req

filename="688training_data.pkl"
url="http://jesse.ams.jhu.edu/~dan/FinalAssignmentSpring2024/Train_0433.pkl"

res=req.get(url)
print(res)
with open(filename, 'wb') as fout:
    for chunk in res.iter_content(chunk_size=1024):
        if chunk:
            fout.write(chunk)

<Response [200]>


In [2]:
import requests as req

filename="688test_data.pkl"
url="http://jesse.ams.jhu.edu/~dan/FinalAssignmentSpring2024/TestPredictors_0433.pkl"

res=req.get(url)
print(res)
with open(filename, 'wb') as fout:
    for chunk in res.iter_content(chunk_size=1024):
        if chunk:
            fout.write(chunk)

<Response [200]>


**Reading the pickled dataset into pandas**

Once you have stored the pickled dataset using some file name, you can read it into pandas as a data frame using the following code.

In [3]:
import pandas as pd
training_data="688training_data.pkl"
train=pd.read_pickle(training_data)
train

Unnamed: 0,URLID,educat,histor,biol,human,wateranimal,math,social,rest,sleep,...,museum,child,biograph,soul,difficult,highest,skin,length,date,word_present
98709,URLID_YFVL,1.0,7.0,0.0,0.0,0.0,0.0,1.0,5.0,0.0,...,0.0,27.0,0.0,0.0,0.0,0.0,7.0,44411,2023-03-21,1
234748,URLID_NVNF,1.0,3.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,...,0.0,26.0,0.0,0.0,,0.0,8.0,49452,2023-03-11,0
88342,URLID_XXPE,0.0,4.0,0.0,0.0,0.0,1.0,1.0,2.0,0.0,...,0.0,25.0,0.0,0.0,0.0,,8.0,37996,2023-01-30,1
61061,URLID_ILOB,0.0,5.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,...,1.0,24.0,0.0,0.0,0.0,0.0,0.0,31464,2023-01-16,0
17148,URLID_QCFD,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,26.0,2.0,0.0,0.0,1.0,7.0,27993,2023-03-15,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160845,URLID_NTBH,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,7.0,14787,2022-12-06,0
39094,URLID_WARN,4.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,,25.0,0.0,0.0,0.0,0.0,8.0,39786,2022-12-30,1
7934,URLID_JWHH,0.0,2.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,26.0,0.0,0.0,0.0,0.0,8.0,41782,2023-02-01,0
7594,URLID_VJQB,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,24.0,0.0,0.0,0.0,0.0,0.0,16800,2023-02-21,0


In [4]:
numeric_cols = train.columns[1:41]  
median_values = train[numeric_cols].median()

train[numeric_cols] = train[numeric_cols].fillna(median_values)

print(train[numeric_cols].isnull().sum())


educat         0
histor         0
biol           0
human          0
wateranimal    0
math           0
social         0
rest           0
sleep          0
speech         0
ghost          0
data           0
process        0
business       0
customer       0
depth          0
negative       0
quantitat      0
first          0
party          0
optim          0
private        0
portal         0
game           0
conduct        0
experienc      0
visual         0
audio          0
doctor         0
engage         0
farm           0
decision       0
petrol         0
museum         0
child          0
biograph       0
soul           0
difficult      0
highest        0
skin           0
dtype: int64


In [5]:
import pandas as pd
test_data="688test_data.pkl"
test=pd.read_pickle(test_data)
test

Unnamed: 0,URLID,educat,histor,biol,human,wateranimal,math,social,rest,sleep,...,farm,decision,petrol,museum,child,biograph,soul,difficult,highest,skin
156481,URLID_UMIC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,0.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,7.0
249417,URLID_NEDA,1.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,7.0
44014,URLID_SYMC,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
121528,URLID_ZFXI,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,7.0
249878,URLID_VMSJ,21.0,,2.0,0.0,0.0,3.0,19.0,9.0,1.0,...,1.0,2.0,0.0,0.0,46.0,0.0,1.0,6.0,15.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30729,URLID_DJFD,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,8.0
41209,URLID_YOPA,0.0,,0.0,,0.0,2.0,1.0,3.0,0.0,...,0.0,0.0,0.0,0.0,25.0,0.0,0.0,2.0,0.0,2.0
62068,URLID_PATJ,0.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,2.0,0.0
29958,URLID_VUHJ,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
median_values = train.iloc[:, 1:41].median()  
test.iloc[:, 1:41] = test.iloc[:, 1:41].fillna(median_values)
print(test[numeric_cols].isnull().sum())

educat         0
histor         0
biol           0
human          0
wateranimal    0
math           0
social         0
rest           0
sleep          0
speech         0
ghost          0
data           0
process        0
business       0
customer       0
depth          0
negative       0
quantitat      0
first          0
party          0
optim          0
private        0
portal         0
game           0
conduct        0
experienc      0
visual         0
audio          0
doctor         0
engage         0
farm           0
decision       0
petrol         0
museum         0
child          0
biograph       0
soul           0
difficult      0
highest        0
skin           0
dtype: int64


In [8]:
mean_length = train['length'].mean()
median_length = train['length'].median()
std_length = train['length'].std()

print(f"Mean page length: {mean_length}")
print(f"Median page length: {median_length}")
print(f"Standard deviation of page length: {std_length}")

Mean page length: 41877.776805
Median page length: 30849.0
Standard deviation of page length: 33938.48088654548


In [9]:
import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error


In [10]:
X = train.iloc[:, 1:41] 
y = train['length']  

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [11]:
xgb_regressor = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)

xgb_regressor.fit(X_train, y_train)


In [12]:
length_predictions_val = xgb_regressor.predict(X_val)

mae_val = mean_absolute_error(y_val, length_predictions_val)
print("Validation MAE:", mae_val)


Validation MAE: 8586.41395180664


In [13]:
mae_scores = cross_val_score(xgb_regressor, X, y, cv=5, scoring='neg_mean_absolute_error')

mae_scores = -mae_scores
print("Cross-validated MAEs:", mae_scores)
print("Mean MAE:", np.mean(mae_scores))
print("Standard Deviation of MAE:", np.std(mae_scores))


Cross-validated MAEs: [8651.73717198 8487.97434167 8531.42207131 8485.90481439 8585.80458939]
Mean MAE: 8548.568597751466
Standard Deviation of MAE: 63.12319821760647


In [14]:
X_test = test.iloc[:, 1:41]  
predicted_lengths = xgb_regressor.predict(X_test)

predictions_df = pd.DataFrame({
    'URLID': test['URLID'],
    'Predicted Length': predicted_lengths
})
print(predictions_df.head())


             URLID  Predicted Length
156481  URLID_UMIC      19965.410156
249417  URLID_NEDA      23201.533203
44014   URLID_SYMC      10581.008789
121528  URLID_ZFXI      29280.501953
249878  URLID_VMSJ     185895.625000


In [15]:
# Random forest method
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.ensemble import RandomForestRegressor

X = train.iloc[:, 1:41] 
y = train['length'] 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)  # 100 trees in the forest

rf_model.fit(X_train, y_train)

length_predictions_val = rf_model.predict(X_val)

mae_val = mean_absolute_error(y_val, length_predictions_val)
print("Validation MAE:", mae_val)

mae_scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_absolute_error')

mae_scores = -mae_scores
print("Cross-validated MAEs:", mae_scores)
print("Mean MAE:", np.mean(mae_scores))
print("Standard Deviation of MAE:", np.std(mae_scores))


Validation MAE: 6189.672245210989
Cross-validated MAEs: [6166.58317092 6100.98730581 6097.44561248 6155.70324662 6135.73194188]
Mean MAE: 6131.290255542727
Standard Deviation of MAE: 28.017998550077476


In [16]:
X_test = test.iloc[:, 1:41]  

predicted_lengths_rf = rf_model.predict(X_test)

predictions_df_rf = pd.DataFrame({
    'URLID': test['URLID'],
    'Predicted Length': predicted_lengths_rf
})

print(predictions_df_rf.head())

predictions_df.to_csv('predicted_lengths.csv', index=False)


             URLID  Predicted Length
156481  URLID_UMIC      19632.075588
249417  URLID_NEDA      20163.724500
44014   URLID_SYMC      11173.781667
121528  URLID_ZFXI      27503.425000
249878  URLID_VMSJ     196869.395310


In [17]:
#Linear regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X_train = train[numeric_cols]  
y_train = train['length']      

X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train_split, y_train_split)

y_val_pred = model.predict(X_val_split)
mae_val = mean_absolute_error(y_val_split, y_val_pred)
print(f"Validation MAE: {mae_val}")

X_test = test[numeric_cols] 

predicted_lengths = model.predict(X_test)

predictions_df = pd.DataFrame({
    'URLID': test['URLID'],
    'Predicted Length': predicted_lengths
})
print(predictions_df.head())


Validation MAE: 11013.683081670313
             URLID  Predicted Length
156481  URLID_UMIC      24063.863508
249417  URLID_NEDA      30900.188253
44014   URLID_SYMC       7774.817417
121528  URLID_ZFXI      31741.204045
249878  URLID_VMSJ     231701.124335


In [18]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

X_full_train = train.iloc[:, 1:41]  
y_full_train = train['length']     

model = LinearRegression()

neg_mae_scores = cross_val_score(model, X_full_train, y_full_train, scoring='neg_mean_absolute_error', cv=5)

mae_scores = -neg_mae_scores
mean_mae = mae_scores.mean()
std_mae = mae_scores.std()

print(f"Cross-validated MAEs: {mae_scores}")
print(f"Mean MAE: {mean_mae}")
print(f"Standard Deviation of MAE: {std_mae}")


Cross-validated MAEs: [11033.61005377 10747.2372415  10826.41944538 10867.0506322
 10897.31143663]
Mean MAE: 10874.325761894706
Standard Deviation of MAE: 94.23296829706753


In [19]:
from sklearn.model_selection import train_test_split
import xgboost as xgb

X = train.iloc[:, 1:41] 
y = train['word_present']  

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [20]:
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100, learning_rate=0.1, max_depth=5, seed=42)

xgb_classifier.fit(X_train, y_train)


In [21]:
from sklearn.metrics import roc_curve, precision_recall_curve

probabilities = xgb_classifier.predict_proba(X_val)[:, 1]

fpr, tpr, thresholds = roc_curve(y_val, probabilities)

target_fpr = 0.05
closest_index = np.argmin(np.abs(fpr - target_fpr))
chosen_threshold = thresholds[closest_index]
chosen_tpr = tpr[closest_index]

print(f"Chosen Threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {chosen_tpr}")


Chosen Threshold: 0.6104010343551636 with FPR: 0.050029041626331074 and TPR: 0.520352733686067


In [22]:
test_probabilities = xgb_classifier.predict_proba(X_test)[:, 1]

test_predictions = (test_probabilities >= chosen_threshold).astype(int)

predictions_df = pd.DataFrame({
    'URLID': test['URLID'],
    'Word Present': test_predictions
})
print(predictions_df.head())


             URLID  Word Present
156481  URLID_UMIC             0
249417  URLID_NEDA             0
44014   URLID_SYMC             0
121528  URLID_ZFXI             0
249878  URLID_VMSJ             1


In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

X = train.iloc[:, 1:41]  
y = train['word_present']  

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)



In [24]:
probabilities = rf_classifier.predict_proba(X_val)[:, 1]

fpr, tpr, thresholds = roc_curve(y_val, probabilities)
roc_auc = auc(fpr, tpr)

target_fpr = 0.05
closest_index = np.argmin(np.abs(fpr - target_fpr))
chosen_threshold = thresholds[closest_index]
chosen_tpr = tpr[closest_index]

print(f"Chosen Threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {chosen_tpr}")
print(f"Area Under Curve (AUC): {roc_auc}")


Chosen Threshold: 0.5278571428571428 with FPR: 0.04995159728944821 and TPR: 0.73657848324515
Area Under Curve (AUC): 0.9231588165494586


In [25]:
test_probabilities = rf_classifier.predict_proba(X_test)[:, 1]

test_predictions_rf = (test_probabilities >= chosen_threshold).astype(int)

predictions_df_rf = pd.DataFrame({
    'URLID': test['URLID'],
    'Word Present': test_predictions_rf
})
print(predictions_df.head())


             URLID  Word Present
156481  URLID_UMIC             0
249417  URLID_NEDA             0
44014   URLID_SYMC             0
121528  URLID_ZFXI             0
249878  URLID_VMSJ             1


In [26]:
from sklearn.metrics import confusion_matrix

predictions = (probabilities >= chosen_threshold).astype(int)

cm = confusion_matrix(y_val, predictions)
TP = cm[1, 1]  # True Positives
FN = cm[1, 0]  # False Negatives

TPR = TP / (TP + FN)

print(f"Estimated True Positive Rate on Validation Data: {TPR}")


Estimated True Positive Rate on Validation Data: 0.73657848324515


In [27]:
X_train = train.iloc[:, 1:41]  
y_train = train['word_present']

In [28]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
model.fit(X_train, y_train)


In [29]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
import numpy as np

X_train_part, X_val, y_train_part, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

probabilities = model.predict_proba(X_val)[:, 1]  

fpr, tpr, thresholds = roc_curve(y_val, probabilities)

target_fpr = 0.05
closest_index = (np.abs(fpr - target_fpr)).argmin()  
chosen_threshold = thresholds[closest_index]

print(f"Chosen threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {tpr[closest_index]}")


Chosen threshold: 0.5931473894736792 with FPR: 0.04999031945788964 and TPR: 0.42504409171075835


In [30]:
from sklearn.metrics import confusion_matrix

probabilities = model.predict_proba(X_val)[:, 1]  

predictions = (probabilities >= chosen_threshold).astype(int)

cm = confusion_matrix(y_val, predictions)

TP = cm[1, 1]  
FN = cm[1, 0]  

TPR = TP / (TP + FN)
print(f"Estimated True Positive Rate on Validation Data: {TPR}")


Estimated True Positive Rate on Validation Data: 0.42504409171075835


In [31]:
import pandas as pd

train['edited_2023'] = pd.to_datetime(train['date']).dt.year == 2023
train['edited_2023'] = train['edited_2023'].astype(int)  


In [32]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train = train.iloc[:, 1:41]
y_train = train['edited_2023']

X_train_part, X_val, y_train_part, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_part, y_train_part)


In [33]:
from sklearn.metrics import roc_curve

probabilities = rf_classifier.predict_proba(X_val)[:, 1]

fpr, tpr, thresholds = roc_curve(y_val, probabilities)

target_fpr = 0.05
closest_index = np.argmin(np.abs(fpr - target_fpr))
chosen_threshold = thresholds[closest_index]
chosen_tpr = tpr[closest_index]

print(f"Chosen Threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {chosen_tpr}")


Chosen Threshold: 0.8804292929292928 with FPR: 0.048496695522665924 and TPR: 0.4716478107803261


In [34]:
test_probabilities = rf_classifier.predict_proba(X_test)[:, 1]
test_predictions_2003 = (test_probabilities >= chosen_threshold).astype(int)
test['Edited in 2023'] = test_predictions_2003
print(test.head())



             URLID  educat  histor  biol  human  wateranimal  math  social  \
156481  URLID_UMIC     0.0     0.0   0.0    0.0          0.0   0.0     0.0   
249417  URLID_NEDA     1.0     0.0   0.0    0.0          0.0   0.0     0.0   
44014   URLID_SYMC     0.0     1.0   1.0    0.0          0.0   0.0     2.0   
121528  URLID_ZFXI     0.0     1.0   0.0    0.0          0.0   0.0     0.0   
249878  URLID_VMSJ    21.0     1.0   2.0    0.0          0.0   3.0    19.0   

        rest  sleep  ...  decision  petrol  museum  child  biograph  soul  \
156481   1.0    0.0  ...       0.0     0.0     0.0   25.0       0.0   0.0   
249417   0.0    0.0  ...       0.0     0.0     0.0   25.0       0.0   0.0   
44014    0.0    0.0  ...       0.0     0.0     0.0    2.0       0.0   0.0   
121528   1.0    0.0  ...       0.0     0.0     0.0   25.0       0.0   0.0   
249878   9.0    1.0  ...       2.0     0.0     0.0   46.0       0.0   1.0   

        difficult  highest  skin  Edited in 2023  
156481        0.0

In [35]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'class_weight': [{0:1, 1:3}, {0:1, 1:5}] 
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, scoring='recall')
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

new_probabilities = best_rf.predict_proba(X_val)[:, 1]
new_fpr, new_tpr, new_thresholds = roc_curve(y_val, new_probabilities)
new_chosen_index = np.argmin(np.abs(new_fpr - target_fpr))
print(f"New Chosen Threshold: {new_thresholds[new_chosen_index]} with FPR: {new_fpr[new_chosen_index]} and TPR: {new_tpr[new_chosen_index]}")


New Chosen Threshold: 0.9346937533283248 with FPR: 0.04998603741971516 and TPR: 0.41425983525310184


In [36]:
import pandas as pd

train['edited_2023'] = pd.to_datetime(train['date']).dt.year == 2023
train['edited_2023'] = train['edited_2023'].astype(int)  # Convert boolean to int (0 or 1)
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(train.iloc[:, 1:41], train['edited_2023'], test_size=0.2, random_state=42)
from xgboost import XGBClassifier

xgb_classifier = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
xgb_classifier.fit(X_train, y_train)
from sklearn.metrics import roc_curve

probabilities = xgb_classifier.predict_proba(X_val)[:, 1]
fpr, tpr, thresholds = roc_curve(y_val, probabilities)

target_fpr = 0.05
closest_index = np.argmin(np.abs(fpr - target_fpr))
chosen_threshold = thresholds[closest_index]
print(f"Chosen Threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {tpr[closest_index]}")
test_probabilities = xgb_classifier.predict_proba(X_test)[:, 1]
test_predictions = (test_probabilities >= chosen_threshold).astype(int)

predictions_df = pd.DataFrame({
    'URLID': test['URLID'],
    'Edited in 2023': test_predictions
})
print(predictions_df.head())


Chosen Threshold: 0.8846389055252075 with FPR: 0.04998603741971516 and TPR: 0.36394708958539834
             URLID  Edited in 2023
156481  URLID_UMIC               0
249417  URLID_NEDA               0
44014   URLID_SYMC               0
121528  URLID_ZFXI               0
249878  URLID_VMSJ               1


In [37]:
import pandas as pd

train['edited_2023'] = pd.to_datetime(train['date']).dt.year == 2023
train['edited_2023'] = train['edited_2023'].astype(int)  # Convert boolean to int (0 or 1)


In [38]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train = train.iloc[:, 1:41]
y_train = train['edited_2023']

X_train_part, X_val, y_train_part, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
model.fit(X_train_part, y_train_part)


In [39]:
from sklearn.metrics import roc_curve

probabilities = model.predict_proba(X_val)[:, 1] 

fpr, tpr, thresholds = roc_curve(y_val, probabilities)

target_fpr = 0.05
closest_index = (np.abs(fpr - target_fpr)).argmin()
chosen_threshold = thresholds[closest_index]

print(f"Chosen threshold: {chosen_threshold} with FPR: {fpr[closest_index]} and TPR: {tpr[closest_index]}")


Chosen threshold: 0.8654813451734203 with FPR: 0.04998603741971516 and TPR: 0.3139419626072393


In [40]:
test_probabilities = model.predict_proba(X_test)[:, 1]
test_predictions = (test_probabilities >= chosen_threshold).astype(int)

predictions__df = pd.DataFrame({
    'URLID': test['URLID'],
    'Edited in 2023': test_predictions
})
print(predictions_df.head())


             URLID  Edited in 2023
156481  URLID_UMIC               0
249417  URLID_NEDA               0
44014   URLID_SYMC               0
121528  URLID_ZFXI               0
249878  URLID_VMSJ               1


In [41]:
from sklearn.metrics import confusion_matrix

predictions = (probabilities >= chosen_threshold).astype(int)

cm = confusion_matrix(y_val, predictions)
TP = cm[1, 1]  
FN = cm[1, 0]  

TPR = TP / (TP + FN)

print(f"Estimated True Positive Rate on Validation Data: {TPR}")


Estimated True Positive Rate on Validation Data: 0.3139419626072393


In [42]:
import pandas as pd

predictions_df = pd.DataFrame({
    'URLID': test['URLID'],
    'Length': predicted_lengths_rf,
    'Word Present': test_predictions_rf,
    'Edited 2023': test_predictions_2003
})

predictions_df.to_csv('Final_predictions.csv', index=False)


In [43]:
summary_stats = predictions_df.describe()
print(summary_stats)

              Length  Word Present   Edited 2023
count   50000.000000  50000.000000  50000.000000
mean    42052.584034      0.296120      0.359460
std     31510.969669      0.456549      0.479847
min      5490.547405      0.000000      0.000000
25%     22622.895192      0.000000      0.000000
50%     32093.753333      0.000000      0.000000
75%     49592.956250      1.000000      1.000000
max    670894.560000      1.000000      1.000000
