## Read My News

We will begin by downloading the "MIND-small" version of the "MIND" training dataset from [here](https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip). We will save it into our "data" folder, and then uncompress its contents.

In [27]:
import requests
from zipfile import ZipFile


url = "https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip"
response = requests.get(url, allow_redirects=True)
open("..\\data\\MINDsmall_train.zip", 'wb').write(response.content)
with ZipFile("..\\data\\MINDsmall_train.zip", "r") as zip_file:
    zip_file.extractall("..\\data\\MINDsmall_train")

Now let's load the news articles dataset into a Polars lazy dataframe so we can explore them.

We will use the "\t" separator to load the file since it is separated by tabs.

In [4]:
import polars as pl


news_lf = pl.scan_csv("..\\data\\MINDsmall_train\\news.tsv",
                       separator="\t",
                       has_header=False,
                       schema={"news_id" : pl.datatypes.String,
                               "category" : pl.datatypes.String,
                               "subcategory" : pl.datatypes.String,
                               "title" : pl.datatypes.String,
                               "abstract" : pl.datatypes.String,
                               "url" : pl.datatypes.String,
                               "title_entities" : pl.datatypes.String,
                               "abstract_entities" : pl.datatypes.String},
                        ignore_errors=True)

news_lf.collect()

news_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
str,str,str,str,str,str,str,str
"""N55528""","""lifestyle""","""lifestyleroyals""","""The Brands Queen Elizabeth, Pr…","""Shop the notebooks, jackets, a…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Prince Philip, Duk…","""[]"""
"""N19639""","""health""","""weightloss""","""50 Worst Habits For Belly Fat""","""These seemingly harmless habit…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Adipose tissue"", ""…","""[{""Label"": ""Adipose tissue"", ""…"
"""N61837""","""news""","""newsworld""","""The Cost of Trump's Aid Freeze…","""Lt. Ivan Molchanets peeked ove…","""https://assets.msn.com/labs/mi…","""[]""","""[{""Label"": ""Ukraine"", ""Type"": …"
"""N53526""","""health""","""voices""","""I Was An NBA Wife. Here's How …","""I felt like I was a fraud, and…","""https://assets.msn.com/labs/mi…","""[]""","""[{""Label"": ""National Basketbal…"
"""N38324""","""health""","""medical""","""How to Get Rid of Skin Tags, A…","""They seem harmless, but there'…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Skin tag"", ""Type"":…","""[{""Label"": ""Skin tag"", ""Type"":…"
…,…,…,…,…,…,…,…
"""N16909""","""weather""","""weathertopstories""","""Adapting, Learning And Soul Se…","""Woolsey Fire Anniversary: A co…","""https://assets.msn.com/labs/mi…","""[{""Label"": ""Woolsey Fire"", ""Ty…","""[{""Label"": ""Woolsey Fire"", ""Ty…"
"""N47585""","""lifestyle""","""lifestylefamily""","""Family says 13-year-old Broadw…",,"""https://assets.msn.com/labs/mi…","""[{""Label"": ""Broadway theatre"",…","""[]"""
"""N7482""","""sports""","""more_sports""","""St. Dominic soccer player trie…","""Sometimes, what happens on the…","""https://assets.msn.com/labs/mi…","""[]""","""[]"""
"""N34418""","""sports""","""soccer_epl""","""How the Sounders won MLS Cup""","""Mark, Jeremiah and Casey were …","""https://assets.msn.com/labs/mi…","""[{""Label"": ""MLS Cup"", ""Type"": …","""[]"""


Since some news articles have no abstract, let's fill those nulls with "No abstract".

In [5]:
news_lf = news_lf.with_columns(
    pl.col("abstract").fill_null("No abstract")
)

Now let's explore the behaviors training dataset.

In [6]:
behaviors_lf = pl.scan_csv("..\\data\\MINDsmall_train\\behaviors.tsv",
                            separator='\t',
                            has_header=False,
                            schema={
                                "impression_id" : pl.datatypes.Int64,
                                "user_id" : pl.datatypes.String,
                                "time" : pl.datatypes.String,
                                "history" : pl.datatypes.String,
                                "impressions" : pl.datatypes.String
                            },
                            ignore_errors=True)

behaviors_lf.collect()

impression_id,user_id,time,history,impressions
i64,str,str,str,str
1,"""U13740""","""11/11/2019 9:05:58 AM""","""N55189 N42782 N34694 N45794 N1…","""N55689-1 N35729-0"""
2,"""U91836""","""11/12/2019 6:11:30 PM""","""N31739 N6072 N63045 N23979 N35…","""N20678-0 N39317-0 N58114-0 N20…"
3,"""U73700""","""11/14/2019 7:01:48 AM""","""N10732 N25792 N7563 N21087 N41…","""N50014-0 N23877-0 N35389-0 N49…"
4,"""U34670""","""11/11/2019 5:28:05 AM""","""N45729 N2203 N871 N53880 N4137…","""N35729-0 N33632-0 N49685-1 N27…"
5,"""U8125""","""11/12/2019 4:11:21 PM""","""N10078 N56514 N14904 N33740""","""N39985-0 N36050-0 N16096-0 N84…"
…,…,…,…,…
156961,"""U21593""","""11/14/2019 10:24:05 PM""","""N7432 N58559 N1954 N43353 N143…","""N2235-0 N22975-0 N64037-0 N476…"
156962,"""U10123""","""11/13/2019 6:57:04 AM""","""N9803 N104 N24462 N57318 N5574…","""N3841-0 N61571-0 N58813-0 N282…"
156963,"""U75630""","""11/14/2019 10:58:13 AM""","""N29898 N59704 N4408 N9803 N536…","""N55913-0 N62318-0 N53515-0 N10…"
156964,"""U44625""","""11/13/2019 2:57:02 PM""","""N4118 N47297 N3164 N43295 N605…","""N6219-0 N3663-0 N31147-0 N5836…"


For now, let's ignore the fact that the third column is being stored in the dataframe as a string. If we need to later, we'll convert all of those dates into a proper Datetime data type.

Let's get all the unique news categories so we can use them for the news recommendation function.

In [7]:
categories = news_lf.select(pl.col("category").unique()).collect()
categories.to_series().to_list()

['sports',
 'video',
 'travel',
 'tv',
 'news',
 'northamerica',
 'music',
 'finance',
 'autos',
 'foodanddrink',
 'kids',
 'health',
 'weather',
 'entertainment',
 'lifestyle',
 'movies']

Now, let's make a funcion to get a given number of random news articles by category.

First we will define the function's JSON object for OpenAI.

In [8]:

NEWS_RECS_BY_CATEGORY = {
    "type": "function",
    "function": {
        "name": "get_random_news_by_category",
        "description": "Returns the provided number of news article headlines \
from a given category. This function requires at least one category to work.",
        "parameters": {
            "type": "object",
            "properties": {
                "number": { "type": "number"},
                "category": {
                    "type": "string",
                    "enum": [
                        'sports',
                        'travel',
                        'health',
                        'news',
                        'movies',
                        'tv',
                        'entertainment',
                        'video',
                        'lifestyle',
                        'finance',
                        'kids',
                        'weather',
                        'northamerica',
                        'autos',
                        'foodanddrink',
                        'music'
                    ]
                }
            },
            "required": ["number", "category"],
        },
    },
}


Now let's create the actual function, noting that OpenAI requires the return value of functions to be in a single string, so we will concatenate all news articles' titles into a single string.

In [10]:
import polars as pl


def get_random_news_by_category(news_lf: pl.LazyFrame,
                                number: int,
                                category: str) -> str:
    """ Retrieves random news articles by the given category, and returns
        their title.

    Args:
        news_lf (LazyFrame): the Polars lazy DataFrame with the news articles.
        number (int): the number of news articles by category to return.
        category (str): the category of news articles to return.

    Returns:
        str: the title of each news article.
    """

    news_articles = []

    news_by_cat = news_lf.filter(
        pl.col("category") == category
    ).select(
        pl.col("title")
    ).collect().sample(n=number).rows()
    for article in news_by_cat:
        news_articles.append("Title: \"" + article[0] + "\"")

    return ". ".join(news_articles)


print(get_random_news_by_category(news_lf, 3, "lifestyle"))

Title: "HISD's takeover by Texas education brass official". Title: "ASU fraternity brother who died on campus was brilliant student, family says". Title: "CNU baseball coach marries love of his life at home plate of Captains Park"


Now let's define a function that will retrieve the article's abstract with a provided title.

In [None]:

NEWS_ARTICLE_ABSTRACT_BY_TITLE = {
    "type": "function",
    "function": {
        "name": "get_article_abstract_by_title",
        "description": "Retrieves the news article's abstract with the provide\
d title. This function requires at least one title to function correctly.",
        "parameters": {
            "type": "object",
            "properties": {
                "title": { "type": "string"},
            },
            "required": ["title"],
        },
    },
}

Now let's create the actual function.

In [11]:
def get_article_abstract_by_title(title: str) -> str:
    """ Retrieves a news article's abstract by the given title.

    Args:
        title (str): the article's title to return.

    Returns:
        str: the news article's abstract.
    """

    abstract = news_lf.filter(
        pl.col("title") == title
    ).select(
        pl.col("abstract")
    ).collect()

    if len(abstract.to_series().to_list()) > 0:
        return abstract.to_series().to_list()[0]
    else:
        return "Abstract not found."

get_article_abstract_by_title("Peanut allergy shots? A new Stanford-led study shows an antibody injection could prevent allergic reactions")

'A treatment for severe peanut allergies could come in the form of an antibody injection, according to a new Stanford-led pilot study.'