## Read My News

We will begin by downloading the "MIND-small" version of the "MIND" dataset from [here](https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip). We will save it into our "data" folder, and then uncompress its contents.

In [2]:
import requests
from zipfile import ZipFile
import os


if not os.path.isfile("../data/MINDsmall_dev/news.tsv"):
    url = "https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip"
    response = requests.get(url, allow_redirects=True)
    open("../data/MINDsmall_dev.zip", 'wb').write(response.content)
    with ZipFile("../data/MINDsmall_dev.zip", "r") as zip_file:
        zip_file.extractall("../data/MINDsmall_dev")

Now let's load the news articles dataset into a Polars lazy dataframe so we can explore them.

We will use the "\t" separator to load the file since it is separated by tabs.

In [4]:
import polars as pl


news_lf = pl.scan_csv("../data/MINDsmall_dev/news.tsv",
                       separator="\t",
                       has_header=False,
                       schema={"news_id" : pl.datatypes.String,
                               "category" : pl.datatypes.String,
                               "subcategory" : pl.datatypes.String,
                               "title" : pl.datatypes.String,
                               "abstract" : pl.datatypes.String,
                               "url" : pl.datatypes.String,
                               "title_entities" : pl.datatypes.String,
                               "abstract_entities" : pl.datatypes.String},
                        ignore_errors=True)

news_lf.collect()

news_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
str,str,str,str,str,str,str,str
"""N55528""","""lifestyle""","""lifestyleroyals""","""The Brands Queen Elizabeth, Pr…","""Shop the notebooks, jackets, a…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Prince Philip, Duk…","""[]"""
"""N18955""","""health""","""medical""","""Dispose of unwanted prescripti…",,"""https://assets.msn.com/labs/mi…","""[{""Label"": ""Drug Enforcement A…","""[]"""
"""N61837""","""news""","""newsworld""","""The Cost of Trump's Aid Freeze…","""Lt. Ivan Molchanets peeked ove…","""https://assets.msn.com/labs/mi…","""[]""","""[{""Label"": ""Ukraine"", ""Type"": …"
"""N53526""","""health""","""voices""","""I Was An NBA Wife. Here's How …","""I felt like I was a fraud, and…","""https://assets.msn.com/labs/mi…","""[]""","""[{""Label"": ""National Basketbal…"
"""N38324""","""health""","""medical""","""How to Get Rid of Skin Tags, A…","""They seem harmless, but there'…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Skin tag"", ""Type"":…","""[{""Label"": ""Skin tag"", ""Type"":…"
…,…,…,…,…,…,…,…
"""N63550""","""lifestyle""","""lifestyleroyals""","""Why Kate & Meghan Were on Diff…","""There's no scandal here. It's …","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Meghan, Duchess of…","""[]"""
"""N30345""","""entertainment""","""entertainment-celebrity""","""See the stars at the 2019 Baby…","""Stars like Chrissy Teigen and …","""https://assets.msn.com/labs/mi…","""[]""","""[{""Label"": ""Kate Hudson"", ""Typ…"
"""N30135""","""news""","""newsgoodnews""","""Tennessee judge holds lawyer's…","""Tennessee Court of Appeals Jud…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Tennessee"", ""Type""…","""[{""Label"": ""Tennessee Court of…"
"""N44276""","""autos""","""autossports""","""Best Sports Car Deals for Octo…",,"""https://assets.msn.com/labs/mi…","""[{""Label"": ""Peugeot RCZ"", ""Typ…","""[]"""


Since some news articles have no abstract, let's fill those nulls with "No abstract".

In [5]:
news_lf = news_lf.with_columns(
    pl.col("abstract").fill_null("No abstract")
)

Now let's explore the behaviors training dataset.

In [5]:
behaviors_lf = pl.scan_csv("../data/MINDsmall_dev/behaviors.tsv",
                            separator='\t',
                            has_header=False,
                            schema={
                                "impression_id" : pl.datatypes.Int64,
                                "user_id" : pl.datatypes.String,
                                "time" : pl.datatypes.String,
                                "history" : pl.datatypes.String,
                                "impressions" : pl.datatypes.String
                            },
                            ignore_errors=True)

behaviors_lf.collect()

impression_id,user_id,time,history,impressions
i64,str,str,str,str
1,"""U80234""","""11/15/2019 12:37:50 PM""","""N55189 N46039 N51741 N53234 N1…","""N28682-0 N48740-0 N31958-1 N34…"
2,"""U60458""","""11/15/2019 7:11:50 AM""","""N58715 N32109 N51180 N33438 N5…","""N20036-0 N23513-1 N32536-0 N46…"
3,"""U44190""","""11/15/2019 9:55:12 AM""","""N56253 N1150 N55189 N16233 N61…","""N36779-0 N62365-0 N58098-0 N54…"
4,"""U87380""","""11/15/2019 3:12:46 PM""","""N63554 N49153 N28678 N23232 N4…","""N6950-0 N60215-0 N6074-0 N1193…"
5,"""U9444""","""11/15/2019 8:25:46 AM""","""N51692 N18285 N26015 N22679 N5…","""N5940-1 N23513-0 N49285-0 N233…"
…,…,…,…,…
73148,"""U77536""","""11/15/2019 8:40:16 PM""","""N28691 N8845 N58434 N37120 N22…","""N496-0 N35159-0 N59856-0 N1327…"
73149,"""U56193""","""11/15/2019 1:11:26 PM""","""N4705 N58782 N53531 N46492 N26…","""N49285-0 N31958-0 N55237-0 N42…"
73150,"""U16799""","""11/15/2019 3:37:06 PM""","""N40826 N42078 N15670 N15295 N6…","""N7043-0 N512-0 N60215-1 N45057…"
73151,"""U8786""","""11/15/2019 8:29:26 AM""","""N3046 N356 N20483 N46107 N4459…","""N23692-0 N19990-0 N20187-0 N59…"


For now, let's ignore the fact that the third column is being stored in the dataframe as a string. If we need to later, we'll convert all of those dates into a proper Datetime data type.

Let's get all the unique news categories so we can use them for the news recommendation function.

In [6]:
categories = news_lf.select(pl.col("category").unique()).collect()
categories.to_series().to_list()

['video',
 'health',
 'finance',
 'lifestyle',
 'movies',
 'sports',
 'news',
 'kids',
 'entertainment',
 'travel',
 'games',
 'foodanddrink',
 'weather',
 'autos',
 'tv',
 'music']

Now, let's make a funcion to get a given number of random news articles by category.

First we will define the function's JSON object for OpenAI.

In [7]:

NEWS_RECS_BY_CATEGORY = {
    "type": "function",
    "function": {
        "name": "get_random_news_by_category",
        "description": "Returns the provided number of news article headlines \
from a given category. This function requires at least one category to work.",
        "parameters": {
            "type": "object",
            "properties": {
                "number": { "type": "number"},
                "category": {
                    "type": "string",
                    "enum": [
                        'sports',
                        'travel',
                        'health',
                        'news',
                        'movies',
                        'tv',
                        'entertainment',
                        'video',
                        'lifestyle',
                        'finance',
                        'kids',
                        'weather',
                        'northamerica',
                        'autos',
                        'foodanddrink',
                        'music'
                    ]
                }
            },
            "required": ["number", "category"],
        },
    },
}


Now let's create the actual function, noting that OpenAI requires the return value of functions to be in a single string, so we will concatenate all news articles' titles into a single string.

In [8]:
import polars as pl


def get_random_news_by_category(news_lf: pl.LazyFrame,
                                number: int,
                                category: str) -> str:
    """ Retrieves random news articles by the given category, and returns
        their title.

    Args:
        news_lf (LazyFrame): the Polars lazy DataFrame with the news articles.
        number (int): the number of news articles by category to return.
        category (str): the category of news articles to return.

    Returns:
        str: the title of each news article.
    """

    news_articles = []

    news_by_cat = news_lf.filter(
        pl.col("category") == category
    ).select(
        pl.col("title")
    ).collect().sample(n=number).rows()
    for article in news_by_cat:
        news_articles.append("Title: \"" + article[0] + "\"")

    return ". ".join(news_articles)


print(get_random_news_by_category(news_lf, 3, "lifestyle"))

Title: "33 Most Common Reasons Why Relationships Fail". Title: "Friend defends mom who left children alone in deadly fire: 'Not everybody is a perfect parent'". Title: "Newborns dressed up as Mister Rogers on National Kindness Day"


Now let's define a function that will retrieve the article's abstract with a provided title.

In [9]:

NEWS_ARTICLE_ABSTRACT_BY_TITLE = {
    "type": "function",
    "function": {
        "name": "get_article_abstract_by_title",
        "description": "Retrieves the news article's abstract with the provide\
d title. This function requires at least one title to function correctly.",
        "parameters": {
            "type": "object",
            "properties": {
                "title": { "type": "string"},
            },
            "required": ["title"],
        },
    },
}

Now let's create the actual function.

In [17]:
def get_article_abstract_by_title(title: str) -> str:
    """ Retrieves a news article's abstract by the given title.

    Args:
        title (str): the article's title to return.

    Returns:
        str: the news article's abstract.
    """

    abstract = news_lf.filter(
        pl.col("title") == title
    ).select(
        pl.col("abstract")
    ).collect()
    if len(abstract.to_series().to_list()) > 0:
        return abstract.to_series().to_list()[0]
    else:
        return "Abstract not found."

get_article_abstract_by_title("Instagram Filters with Plastic Surgery-Inspired Effects Could Soon Disappear")

'In an effort to combat some of the negative mental impacts caused by social media, the company announced it is removing all of its filters that give a plastic surgery-like effect.'