# Read the News Analysis

Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.

Given the vast amount of news articles in circulation, identifying and organizing articles by topic is a useful activity. This can help us sift through the enormous amount of information out there so we can find the news relevant to our interests, or even allow us to build a news recommendation engine!

[The News International](https://www.thenews.com.pk/) is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a [Kaggle Dataset of The News International articles](https://www.kaggle.com/asad1m9a9h6mood/news-articles) is provided in the workspace.

In this project we will use term frequency-inverse document frequency (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

Let’s get started!

## Imports and Data Preparation

In order to calculate tf-idf scores for the articles in the news dataset, we need to import `CountVectorizer`, `TfidfTransformer`, and `TfidfVectorizer` from `sklearn.feature_extraction.text`.

In [1]:
import pandas as pd
import numpy as np
from articles import articles
from preprocessing import preprocess_text
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

Provided in `articles.py` is a selection of 10 articles from The News International. Each article, stored as a string, is given as a corpus in the list articles.

Let's print one of the articles and read its contents.

In [2]:
# view article
print(articles[2])

KARACHI: Wholesale market rates for sugar dropped to less than Rs 50 per kg following the resumption of sugar cane crushing by sugar mills in Sindh. Within two days, the rate dropped by Rs 1.70 to Rs 49.80 per kg in Karachi Whole Sale Market. According to dealers, the resumption of sugar cane crushing by the mills stabilised the supply to the market with an immediate effect on price as well. Industry experts said that the quality of sugar cane is excellent in Sindh and approximately 100 kg of sugar cane can produce 11 kg of sugar.


Before proceeding, let’s preprocess each article by performing _tokenization_ and _lemmatization_.

We'll use a function `preprocess_text()` (imported from `preprocessing.py`) that accepts a string as input and returns a preprocessed string.

In [3]:
# preprocess each article in articles and store the processed articles in a list called processed_articles
processed_articles = [preprocess_text(article) for article in articles]

# print out one of the preprocessed articles
print(processed_articles[3])

islamabad long queue of vehicle on fuel station be visible in different part of the country a the petrol become rare commodity on thursday federal minister for petroleum shahid khaqan abbasi say it may take up to ten day to bring the situation to normality he claim that northern area of pakistan have be face the petrol shortage the minister cite the recent decline in petroleum price and delay in a shipment a reason for the shortage he say situation would improve a soon a shipment reach pakistan source tell geo news hat due to financial restraint the pakistan state oil have be unable import petrol


## Calculate Tf-idf Scores

We want to begin analysis by starting off with simple word counts for each article, so we initialize a CountVectorizer object assigned to a variable named `vectorizer`.

Then, we fit and transform our vectorizer on `processed_articles` to get the word counts for each article. The resulting counts are saved to a variable named `counts`.

After we saved the word counts to counts, we can see a DataFrame with the word counts for each article.

In [4]:
# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

# get vocabulary of terms
try:
  feature_names = vectorizer.get_feature_names()
except:
  pass

# get article index
try:
  article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
  pass

# create pandas DataFrame with word counts
try:
  df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
  print(df_word_counts)
except:
  pass

        Article 1  Article 2  Article 3  Article 4  Article 5  Article 6  \
abbasi          0          0          0          1          0          0   
abide           1          0          0          0          0          0   
about           0          0          0          0          0          0   
accord          0          0          1          0          0          0   
add             1          0          0          0          0          0   
...           ...        ...        ...        ...        ...        ...   
world           0          0          0          0          0          3   
would           0          0          0          1          0          0   
year            0          1          0          0          0          0   
yi              0          0          0          0          0          0   
yuan            0          0          0          0          0          0   

        Article 7  Article 8  Article 9  Article 10  
abbasi          0          0     

Now that we have the word counts for each article, let’s convert them into tf-idf scores.

We need to initialize a `TfidfTransformer` object with keyword argument `norm=None` and save it to a variable `transformer`.

Then, we need to fit and transform our transformer on counts to convert the word counts into tf-idf scores for each article. The resulting tf-idf scores are saved to a variable named `tfidf_scores_transformed`.

After that, we can see another DataFrame that contains the tf-idf scores for each article.

In [5]:
# convert counts to tf-idf
transformer = TfidfTransformer(norm=None)
tfidf_scores_transformed = transformer.fit_transform(counts)

# create pandas DataFrame(s) with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

        Article 1  Article 2  Article 3  Article 4  Article 5  Article 6  \
abbasi   0.000000   0.000000   0.000000   2.704748        0.0   0.000000   
abide    2.704748   0.000000   0.000000   0.000000        0.0   0.000000   
about    0.000000   0.000000   0.000000   0.000000        0.0   0.000000   
accord   0.000000   0.000000   2.704748   0.000000        0.0   0.000000   
add      2.299283   0.000000   0.000000   0.000000        0.0   0.000000   
...           ...        ...        ...        ...        ...        ...   
world    0.000000   0.000000   0.000000   0.000000        0.0   8.114244   
would    0.000000   0.000000   0.000000   2.299283        0.0   0.000000   
year     0.000000   2.704748   0.000000   0.000000        0.0   0.000000   
yi       0.000000   0.000000   0.000000   0.000000        0.0   0.000000   
yuan     0.000000   0.000000   0.000000   0.000000        0.0   0.000000   

        Article 7  Article 8  Article 9  Article 10  
abbasi   0.000000        0.0   0.

Amazing! Now ve have tf-idf scores for each article. But we want to confirm, however, that the TfidfTransformer gives the same results as directly using the TfidfVectorizer.

To do that, we start with initializing a `TfidfVectorizer` object with keyword argument `norm=None` saved to a variable `vectorizer`.

Then, we fit and transform the vectorizer on `processed_articles` to calculate the tf-idf scores for each article in one step. The resulting tf-idf scores are saved to a variable named `tfidf_scores`.

Now we can see another DataFrame appear in the browser component. Do the tf-idf scores given by TfidfVectorizer look the same as those given by TfidfTransformer?

In [6]:
# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_articles)

try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

        Article 1  Article 2  Article 3  Article 4  Article 5  Article 6  \
abbasi   0.000000   0.000000   0.000000   2.704748        0.0   0.000000   
abide    2.704748   0.000000   0.000000   0.000000        0.0   0.000000   
about    0.000000   0.000000   0.000000   0.000000        0.0   0.000000   
accord   0.000000   0.000000   2.704748   0.000000        0.0   0.000000   
add      2.299283   0.000000   0.000000   0.000000        0.0   0.000000   
...           ...        ...        ...        ...        ...        ...   
world    0.000000   0.000000   0.000000   0.000000        0.0   8.114244   
would    0.000000   0.000000   0.000000   2.299283        0.0   0.000000   
year     0.000000   2.704748   0.000000   0.000000        0.0   0.000000   
yi       0.000000   0.000000   0.000000   0.000000        0.0   0.000000   
yuan     0.000000   0.000000   0.000000   0.000000        0.0   0.000000   

        Article 7  Article 8  Article 9  Article 10  
abbasi   0.000000        0.0   0.

Let’s confirm that the tf-idf scores given by TfidfTransformer and TfidfVectorizer are the same.

In [7]:
# check if tf-idf scores are equal
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
  print(pd.DataFrame({'Are the tf-idf scores the same?': ['YES']}))
else:
  print(pd.DataFrame({'Are the tf-idf scores the same?': ['No, something is wrong :(']}))

  Are the tf-idf scores the same?
0                             YES


## Analyze the Results

A simple way of identifying the “topic” of a document is to label the document with its highest-scoring tf-idf term. While this is a more naive approach than others, it is a quick and easy way of getting insight into the topic of a document.

Let's write a for loop that iterates a variable i through the values 1 to 10.

The `Pandas` Series method `.idxmax()` is a helpful tool for returning the index of the highest value in a DataFrame column. We will use this method to find the highest scoring tf-idf term for each article.

On each pass through the for loop, we will print the index of the term with the highest tf-idf score for that article (from Article 1 to Article 10).

Compare the actual text of the articles to the selected term. Do printed terms give any insight into the topic of the respective articles?

In [8]:
# get highest scoring tf-idf term for each article
for i in range(1, 10):
  print(df_tf_idf[[f'Article {i}']].idxmax())

Article 1    fare
dtype: object
Article 2    hong
dtype: object
Article 3    sugar
dtype: object
Article 4    petrol
dtype: object
Article 5    engine
dtype: object
Article 6    australia
dtype: object
Article 7    car
dtype: object
Article 8    railway
dtype: object
Article 9    cabinet
dtype: object
