In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('cleaned_dataset.csv')

In [3]:
df.shape

(4103, 15)

In [21]:
df.head(2)

Unnamed: 0,abstract,web_url,lead_paragraph,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri
0,Justice Rufus W. Peckham was the last Supreme ...,https://www.nytimes.com/2016/04/02/us/politics...,WASHINGTON — There are the legal giants of the...,"[{'rank': 0, 'subtype': 'watch308', 'caption':...",{'main': 'Supreme Court Fight Rescues a Justic...,"[{'name': 'organizations', 'value': 'Supreme C...",2016-04-02 00:03:48+00:00,article,National,U.S.,"{'original': 'By Carl Hulse', 'person': [{'fir...",News,nyt://article/15e78a58-d7e5-5d25-85db-72ca1241...,942,nyt://article/15e78a58-d7e5-5d25-85db-72ca1241...
1,"“All those who believe in psychokinesis, raise...",https://www.nytimes.com/2016/04/02/arts/televi...,"Ronnie Corbett, the diminutive comedian who te...","[{'rank': 0, 'subtype': 'watch308', 'caption':...","{'main': 'Ronnie Corbett, of ‘The Two Ronnies’...","[{'name': 'subject', 'value': 'Deaths (Obituar...",2016-04-02 00:08:19+00:00,article,Culture,Arts,"{'original': 'By Daniel E. Slotnik', 'person':...",Obituary (Obit),nyt://article/74200086-ce17-5db6-aeb7-7c7180d9...,539,nyt://article/74200086-ce17-5db6-aeb7-7c7180d9...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4103 entries, 0 to 4102
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   abstract          4103 non-null   object
 1   web_url           4103 non-null   object
 2   lead_paragraph    4103 non-null   object
 3   multimedia        4103 non-null   object
 4   headline          4103 non-null   object
 5   keywords          4103 non-null   object
 6   pub_date          4103 non-null   object
 7   document_type     4103 non-null   object
 8   news_desk         4103 non-null   object
 9   section_name      4103 non-null   object
 10  byline            4103 non-null   object
 11  type_of_material  4103 non-null   object
 12  _id               4103 non-null   object
 13  word_count        4103 non-null   int64 
 14  uri               4103 non-null   object
dtypes: int64(1), object(14)
memory usage: 480.9+ KB


In [6]:
df["document_type"].value_counts()  # we may only keep the article 

document_type
article       3893
multimedia     210
Name: count, dtype: int64

## Supervised Learning: Text Classification

### Explore potential labels to be predicted

There are three features that can be used to as label: news_desk, section_name, type_of_material.

To keep the number of labels minimal, I have decided to use 'type_of_material' as the label feature. This feature will predict the material type of the article.

I have also chosen to retain only the articles with the 'document_type,' which has reduced the dataset size from 4103 to 3893. I am considering acquiring more data to expand the dataset further.

In [11]:
df["news_desk"].nunique() 

77

In [12]:
df["section_name"].nunique() 

51

In [13]:
df["type_of_material"].nunique() 

18

In [14]:
df["type_of_material"].value_counts() 

type_of_material
News                   2956
Op-Ed                   292
Review                  177
Interactive Feature     143
briefing                 91
Quote                    78
Obituary (Obit)          71
Correction               66
Editorial                57
Slideshow                44
Schedule                 39
Letter                   31
Video                    23
News Analysis            15
List                      8
Question                  6
Brief                     5
Newsletter                1
Name: count, dtype: int64

In [15]:
df_media = df[df["document_type"] == "multimedia"]

In [16]:
df_media["type_of_material"].value_counts()

type_of_material
Interactive Feature    143
Slideshow               44
Video                   23
Name: count, dtype: int64

In [17]:
df = df[df["document_type"] == 'article'] #keep only articles

In [18]:
df.shape

(3893, 15)

In [19]:
df["type_of_material"].value_counts() #the distribution of labels shows the imbalance of the dataset. 

type_of_material
News               2956
Op-Ed               292
Review              177
briefing             91
Quote                78
Obituary (Obit)      71
Correction           66
Editorial            57
Schedule             39
Letter               31
News Analysis        15
List                  8
Question              6
Brief                 5
Newsletter            1
Name: count, dtype: int64

The distribution of labels shows the dataset is quite imbalanced, with "News" being the dominant class. If we want to avoid the imbalanced data bias, I can always retrieve more data to increase the representation of the minority classes. 

Alternatively, there are several other methods that can be employed to address this issue of imbalanced classes. For example, combining all of the classes except "News" into a single category called "non_news" class. which we will be doing binary classification. 


### Explore the potential predictors


I print out the content of “abstract”, “lead_paragraph”, “headline”, “keywords” of the second row (the second article). 

I attempt to determine whether I just use one of the contents or the combination of all. 

To better understand if the contents of those can be used to predict the article's material type, I read the entire second article to see how closely these contents relate to the article's overall topic. 

In [7]:
print(df.loc[1, 'abstract'])

“All those who believe in psychokinesis, raise my right hand,” Mr. Corbett would say in his routine with Ronnie Barker.


In [8]:
print(df.loc[1, 'lead_paragraph'])

Ronnie Corbett, the diminutive comedian who teamed with Ronnie Barker to delight audiences for almost two decades on the hit BBC comedy show “The Two Ronnies,” died on Thursday. He was 85.


In [9]:
print(df.loc[1, 'headline'])

{'main': 'Ronnie Corbett, of ‘The Two Ronnies’ British Comedy Team, Dies at 85', 'kicker': None, 'content_kicker': None, 'print_headline': 'Ronnie Corbett, 85, Oneof Britain’s ‘Two Ronnies’', 'name': None, 'seo': None, 'sub': None}


In [10]:
print(df.loc[1, 'keywords'])

[{'name': 'subject', 'value': 'Deaths (Obituaries)', 'rank': 1, 'major': 'N'}, {'name': 'subject', 'value': 'Television', 'rank': 3, 'major': 'N'}, {'name': 'subject', 'value': 'Comedy and Humor', 'rank': 4, 'major': 'N'}, {'name': 'creative_works', 'value': 'The Two Ronnies (TV Program)', 'rank': 5, 'major': 'N'}, {'name': 'organizations', 'value': 'British Broadcasting Corp', 'rank': 6, 'major': 'N'}, {'name': 'persons', 'value': 'Barker, Ronnie (1929-2005)', 'rank': 7, 'major': 'N'}, {'name': 'persons', 'value': 'Corbett, Ronnie (1930-2016)', 'rank': 8, 'major': 'N'}]


In [22]:
print(df.loc[1, 'web_url']) #to examine the entire article. 

https://www.nytimes.com/2016/04/02/arts/television/ronnie-corbett-one-of-the-two-ronnies-dies-at-85.html


Here is my findings after reading the contents of the four features and the full text of the article: 

This article's material type is marked as obituary.

Abstract: reflect the article's focus on personality, but may not always clearly indicate the material type of the article.

Lead paragraph: This feature is more descriptive, making it a strong candidate for identifying the material type.

Headline: Headlines are designed to give a quick summary of the article's content. In this case, it effectively indicates that the article is an obituary.

Keywords: Keywords are very indicative of the content and context. The presence of "Deaths (Obituaries)" is a strong predictor for the obituary category.

Overall context: after reading the entire article, confirming that these features are closely related to the overall topic and are good predictors of the material type. 

Conclusion: we will combine these features to make a robust predictor. or we can experiment with different combinations(like using all features or using two of the features) to see which yields the best classification results. 
Technically, we will combine these 4 features into a single feature as predictor. 




## Unsupervised Learning: Topic Modeling.

Goal: uncover hidden thematic structures in the data


For topic modeling, we will use the same textual features(the 4 features)as in text classification, but we won't need the label feature. As with text classification, we will also experiment with different combination of the 4 features to see which yields the most coherent and distinct topics. 

### Up to this point, we have selected our features for both supervised and unsupervised learning tasks.

## Feature engineering

Steps used to get from raw dataset to the final features used. which include data preprocessing, feature extraction (NLP like TF-IDF)


The preprocessing steps are similar for both tasks(text classificatin & topic modeling): 

lowercasing, removing special characters, tokenization, removing stop words, possibly stemming or lemmatization. 

## Project
name/title, goals, motivation, closest example

### Title
I am thinking of two titles: 

1, Dual Perspectives: Classifying and Theming NYTimes Articles.

2, The Article Analyst: Machine Learning Insights into NYTimes Journalism.




### Goals
Text Classification (Supervised Learning):
Develop a binary text classification model to categorize articles from The New York Times into two classes: "News" and "Non-News.

Topic Modeling (Unsupervised Learning):
Utilize topic modeling techniques to uncover and analyze the latent thematic structure in the dataset of articles from The New York Times.

General Project Objective

The overarching objective of this project is to leverage machine learning techniques to extract meaningful insights from a dataset of articles from The New York Times. This involves categorizing articles into relevant classes and discovering underlying themes, thereby enabling a deeper understanding of the content and structure of the news articles.

### Motivation

our motivation for undertaking this project is to harness the power of machine learning and natural language processing to deepen our understanding of news content. By classifying and analyzing New York Times articles, we aim to improve the accessibility and navigation of news media, uncovering underlying themes and patterns in journalistic content. This project represents a convergence of technological innovation and media analysis, offering valuable insights for both readers and industry professionals, and contributing to the evolving landscape of data-driven journalism.

### Closest Example

we are asked to find a closest example that is most similar to what we are proposing(stated in our proposal instruction)

The closest example is the course 655 NLP assignments: text classification,which using the English Wikipedia articles, a text of person's Wikipedia biography, determine the person's nationality. 

## Next Steps

Up to this point, I haven't delved into considering the supervised and unsupervised learning approaches, which encompass model selection, evaluation, and visualization methods. We can only contemplate these aspects after deciding on our specific supervised and unsupervised tasks.

The next immediate step I will take is to retrieve more data and prepare it in the format of our current dataset, just in case we need additional data in the future.