# MIND Recommender Challenge
MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area.

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID. 
* Read about the dataset at
https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md
* Download dataset at
https://msnews.github.io/
* Colab version of this notebook:
https://colab.research.google.com/drive/16e8iKz2b3pr7og2m651TSQ40ugHfLR5f?usp=sharing

## Data Exploration

In [1]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

In [17]:
def load_behaviors(path, ct_click=True):
    behaviors = pd.read_csv(path+'behaviors.tsv', 
                              delimiter='\t', 
                              header=None, 
                              names=["Impression ID", "User ID", "Time",
                                     "History", "Impressions"])
    behaviors["History_clicks_length"] = behaviors.History.map(lambda x: len(str(x).split()) if not pd.isna(x) else 0)
    behaviors["Impression_ct"] = behaviors.Impressions.map(lambda x: len(str(x).split()) if not pd.isna(x) else 0)
    behaviors["Date"] = behaviors.Time.map(lambda x: str(x).split()[0] if not pd.isna(x) else 0)
    if ct_click:
        behaviors["Clicks_ct"] = behaviors.Impressions.map(lambda x: sum([int(itm.split("-")[1]) for itm in str(x).split()]) if not pd.isna(x) else 0)
    return behaviors

def output_Smry(behaviors, ct_click=True):
    print(f"Total Sample Count is {behaviors.shape[0]}")
    print(f"Total distinct users is {len(behaviors['User ID'].unique())}")
    print(f"Missing Value Sample Count in User History is {behaviors.History.isna().sum()}({behaviors.History.isna().sum() * 100 / behaviors.shape[0]:.2f}%)")
    print(f"Max History length is {behaviors.History_clicks_length.max()} and minimal length is {behaviors[behaviors.History_clicks_length>0].History_clicks_length.min()} (exclude nas)")
    print(f"Missing Value Sample Count in Impressions {behaviors.Impressions.isna().sum()}")
    print(f"Total Impressions is {behaviors.Impression_ct.sum()}")
    if ct_click:
        print(f"Total Clicks is {behaviors.Clicks_ct.sum()}")
        print(f"Overall CTR is {behaviors.Clicks_ct.sum() *100 / behaviors.Impression_ct.sum():.4f}%")
    print(f"Time range is:")
    print(behaviors.Date.value_counts())

In [8]:
path = "/Users/rain/Downloads/"
data_path_train = path + "MINDlarge_train/"
behaviors_train = load_behaviors(data_path_train)
behaviors_train.head()

Unnamed: 0,Impression ID,User ID,Time,History,Impressions,History_clicks_length,Impression_ct,Date,Clicks_ct
0,1,U87243,11/10/2019 11:30:54 AM,N8668 N39081 N65259 N79529 N73408 N43615 N2937...,N78206-0 N26368-0 N7578-0 N58592-0 N19858-0 N5...,16,19,11/10/2019,4
1,2,U598644,11/12/2019 1:45:29 PM,N56056 N8726 N70353 N67998 N83823 N111108 N107...,N47996-0 N82719-0 N117066-0 N8491-0 N123784-0 ...,24,29,11/12/2019,2
2,3,U532401,11/13/2019 11:23:03 AM,N128643 N87446 N122948 N9375 N82348 N129412 N5...,N103852-0 N53474-0 N127836-0 N47925-1,16,4,11/13/2019,1
3,4,U593596,11/12/2019 12:24:09 PM,N31043 N39592 N4104 N8223 N114581 N92747 N1207...,N38902-0 N76434-0 N71593-0 N100073-0 N108736-0...,13,52,11/12/2019,1
4,5,U239687,11/14/2019 8:03:01 PM,N65250 N122359 N71723 N53796 N41663 N41484 N11...,N76209-0 N48841-0 N67937-0 N62235-0 N6307-0 N3...,339,129,11/14/2019,1


In [18]:
print("Summary of train data:")
output_Smry(behaviors_train)

Summary of train data:
Total Sample Countis 2232748
Total distinct users is 711222
Missing Value Sample Count in User History is 46065(2.06%)
Max History length is 801 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 83507374
Total Clicks is 3383656
Overall CTR is 4.0519%
Time range is:
11/12/2019    478375
11/11/2019    464467
11/13/2019    453494
11/14/2019    431517
11/10/2019    212343
11/9/2019     192552
Name: Date, dtype: int64


In [6]:
behaviors_train.Time[0].split()[0]

'11/10/2019'

In [19]:
data_path_dev = path + "MINDlarge_dev/"
behaviors_dev = load_behaviors(data_path_dev)
print("Summary of dev data:")
output_Smry(behaviors_dev)

Summary of dev data:
Total Sample Countis 376471
Total distinct users is 255990
Missing Value Sample Count in User History is 11270(2.99%)
Max History length is 801 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 14085557
Total Clicks is 574845
Overall CTR is 4.0811%
Time range is:
11/15/2019    376471
Name: Date, dtype: int64


In [20]:
train = pd.concat([behaviors_dev, behaviors_train])
print("Summary of train and dev data:")
output_Smry(train)

Summary of train and dev data:
Total Sample Countis 2609219
Total distinct users is 750434
Missing Value Sample Count in User History is 57335(2.20%)
Max History length is 801 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 97592931
Total Clicks is 3958501
Overall CTR is 4.0561%
Time range is:
11/12/2019    478375
11/11/2019    464467
11/13/2019    453494
11/14/2019    431517
11/15/2019    376471
11/10/2019    212343
11/9/2019     192552
Name: Date, dtype: int64


In [21]:
data_path_test = path + "MINDlarge_test/"
behaviors_test = load_behaviors(data_path_test, ct_click=False)
print("Summary of test data:")
output_Smry(behaviors_test, ct_click=False)

Summary of test data:
Total Sample Countis 2370727
Total distinct users is 702005
Missing Value Sample Count in User History is 29108(1.23%)
Max History length is 1021 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 93115001
Time range is:
11/18/2019    447628
11/20/2019    439238
11/21/2019    420965
11/19/2019    412708
11/22/2019    397746
11/17/2019    168742
11/16/2019     83700
Name: Date, dtype: int64


In [22]:
history_users = set(train["User ID"])
behaviors_test["Is_in_train"]=behaviors_test["User ID"].map(lambda x: x not in history_users)
new_users_test = behaviors_test[behaviors_test.Is_in_train]
old_users_test = behaviors_test[~ behaviors_test.Is_in_train]
print("Summary of test data with users not seen in train:")
output_Smry(new_users_test, ct_click=False)
print("\n")
print("Summary of test data with users seen in train:")
output_Smry(old_users_test, ct_click=False)

Summary of test data in new uesers:
Total Sample Countis 285496
Total distinct users is 126522
Missing Value Sample Count in User History is 29108(10.20%)
Max History length is 424 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 10611299
Time range is:
11/20/2019    55546
11/21/2019    52421
11/18/2019    50800
11/22/2019    50357
11/19/2019    49042
11/17/2019    17521
11/16/2019     9809
Name: Date, dtype: int64
Summary of test data in repeat users:
Total Sample Countis 2085231
Total distinct users is 575483
Missing Value Sample Count in User History is 0(0.00%)
Max History length is 1021 and minimal length is 1 (exclude nas)
Missing Value Sample Count in Impressions 0
Total Impressions is 82503702
Time range is:
11/18/2019    396828
11/20/2019    383692
11/21/2019    368544
11/19/2019    363666
11/22/2019    347389
11/17/2019    151221
11/16/2019     73891
Name: Date, dtype: int64


In [23]:
news_train = pd.read_csv(data_path_train+'news.tsv', 
                         delimiter='\t', 
                         header=None, 
                         names=["News ID", "Category", "SubCategory","Title", "Abstract", "URL","Title Entities", "Abstract Entites"])
news_train.head()

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entites
0,N88753,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N45436,news,newsscienceandtechnology,Walmart Slashes Prices on Last-Generation iPads,Apple's new iPad releases bring big deals on l...,https://assets.msn.com/labs/mind/AABmf2I.html,"[{""Label"": ""IPad"", ""Type"": ""J"", ""WikidataId"": ...","[{""Label"": ""IPad"", ""Type"": ""J"", ""WikidataId"": ..."
2,N23144,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
3,N86255,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[]
4,N93187,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
