## Read the News Analysis

Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.

Given the vast amount of news articles in circulation, identifying and organizing articles by topic is a useful activity. This can help you sift through the enormous amount of information out there so you can find the news relevant to your interests, or even allow you to build a news recommendation engine!

The News International (https://www.thenews.com.pk/) is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a Kaggle Dataset of The News International articles (https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles) is provided in the workspace.

In this project you will use term frequency-inverse document frequency (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

## Imports and Data Preparation

1. In order to calculate tf-idf scores for the articles in the news dataset, you will need some help from scikit-learn. Begin by importing `CountVectorizer`, `TfidfTransformer`, and `TfidfVectorizer` from `sklearn.feature_extraction.text`.


In [1]:
import pandas as pd
import numpy as np
from articles import articles
from preprocessing4 import preprocess_text

# import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

2. Provided in articles.py is a selection of 10 articles from The News International. Each article, stored as a string, is given as a corpus in the list articles.

    Print one of the articles and read its contents.


In [2]:
# view article

print(articles[3])

ISLAMABAD: Long queues of vehicles on fuel stations were visible in different parts of the country as the petrol became rare commodity on Thursday. Federal Minister for Petroleum Shahid Khaqan Abbasi says "it may take up to ten days to bring the situation to normality". He claimed that northern areas of Pakistan had been facing the petrol shortage. The minister cited the recent decline in petroleum prices and delay in a shipment as reasons for the shortage.He said situation would improve as soon as shipment reached Pakistan. Sources told Geo News hat due to financial restraints the Pakistan State Oil has been unable import petrol.


3. Before proceeding, let’s preprocess each article by performing tokenization and lemmatization.

    Provided in **preprocessing4.py** is a function `preprocess_text()` that accepts a string as input and returns a preprocessed string.

    Preprocess each article in `articles` and store the processed articles in a list called `processed_articles`.

    Print out one of the preprocessed articles.


In [3]:
# preprocess articles
processed_articles = [preprocess_text(article) for article in articles]

print(processed_articles[3])

islamabad long queue of vehicle on fuel station be visible in different part of the country a the petrol become rare commodity on thursday federal minister for petroleum shahid khaqan abbasi say it may take up to ten day to bring the situation to normality he claim that northern area of pakistan have be face the petrol shortage the minister cite the recent decline in petroleum price and delay in a shipment a reason for the shortage he say situation would improve a soon a shipment reach pakistan source tell geo news hat due to financial restraint the pakistan state oil have be unable import petrol


## Calculate Tf-idf Scores

4. You want to begin your analysis by starting off with simple word counts for each article. Initialize a `CountVectorizer` object assigned to a variable named `vectorizer`.


In [4]:
# initialize and fit CountVectorizer

vectorizer = CountVectorizer()

5. Fit and transform your `vectorizer` on `processed_articles` to get the word counts for each article. Save the resulting counts to a variable named counts.

    After you save the word counts to `counts`, you will see a DataFrame appear in the browser component. View the DataFrame to see the word counts for each article.


In [5]:
# convert counts to tf-idf

counts = vectorizer.fit_transform(processed_articles)
# print(counts)

6. Now that you have the word counts for each article, let’s convert them into tf-idf scores.

    Initialize a `TfidfTransformer` object with keyword argument `norm=None` saved to a variable `transformer`.


In [6]:
# initialize and fit TfidfVectorizer

transformer = TfidfTransformer(norm=None)

In [7]:
# get vocabulary of terms
try:
    feature_names = vectorizer.get_feature_names()
except:
    pass

# get article index
try:
    article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
    pass

In [8]:
try:
    df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
#     print(df_word_counts)
except:
    pass
df_word_counts

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0,0,0,1,0,0,0,0,0,0
abide,1,0,0,0,0,0,0,0,0,0
about,0,0,0,0,0,0,1,0,0,0
accord,0,0,1,0,0,0,0,0,0,0
add,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
world,0,0,0,0,0,3,0,0,0,0
would,0,0,0,1,0,0,0,0,1,0
year,0,1,0,0,0,0,0,0,0,0
yi,0,0,0,0,0,0,0,0,0,2


7. Fit and transform your `transformer` on `counts` to convert the word counts into tf-idf scores for each article. Save the resulting tf-idf scores to a variable named `tfidf_scores_transformed`.

    After you save the tf-idf scores to `tfidf_scores_transformed`, you will see another DataFrame appear in the browser component. View the DataFrame to see the tf-idf scores for each article.


In [9]:
# check if tf-idf scores are equal

tfidf_scores_transformed = transformer.fit_transform(counts)
# print(tfidf_scores_transformed)


In [10]:
# create pandas DataFrame(s) with tf-idf scores
try:
    df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
#     print(df_tf_idf)
except:
    pass
df_tf_idf

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


8. Amazing! Now you have your tf-idf scores for each article. You want to confirm, however, that the `TfidfTransformer` gives the same results as directly using the `TfidfVectorizer`.

    Initialize a `TfidfVectorizer` object with keyword argument `norm=None` saved to a variable `vectorizer`.

In [11]:
vectorizer = TfidfVectorizer(norm=None)

9. Fit and transform your `vectorizer` on `processed_articles` to calculate the tf-idf scores for each article in one step. Save the resulting tf-idf scores to a variable named `tfidf_scores`.

    After you save the tf-idf scores to `tfidf_scores`, you will see another DataFrame appear in the browser component. View the DataFrame to see the tf-idf scores for each article.

    Do the tf-idf scores given by `TfidfVectorizer` look the same as those given by `TfidfTransformer`? 
    
*YES*

In [18]:
tfidf_scores = vectorizer.fit_transform(processed_articles)
try:
    df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
#     print(df_tf_idf)
except:
    pass
df_tf_idf

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


10. Let’s confirm that the tf-idf scores given by `TfidfTransformer` and `TfidfVectorizer` are the same.

    Paste the following if statement under the comment “check if tf-idf scores are equal”:

    You should see that the tf-idf scores are, in fact, the same!

In [13]:
# get highest scoring tf-idf term for each article

if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
    print(pd.DataFrame({'Are the tf-idf scores the same?':['YES']}))
else:
    print(pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']}))

  Are the tf-idf scores the same?
0                             YES


## Analyze the Results

11. A simple way of identifying the “topic” of a document is to label the document with its highest-scoring tf-idf term. While this is a more naive approach than others, it is a quick and easy way of getting insight into the topic of a document.

    Scroll down to the bottom, and begin by writing a for loop that iterates a variable `i` through the values `1` to `10`.

12. The Pandas Series method `.idxmax()` is a helpful tool for returning the index of the highest value in a DataFrame column. We will use this method to find the highest scoring tf-idf term for each article.

    Within the `for` loop, paste the following code:

        print(df_tf_idf[[f'Article {i}']].idxmax())

    On each pass through the `for` loop, this code will print the index of the term with the highest tf-idf score for that article (from Article 1 to Article 10).

    Compare the actual text of the articles to the selected term. Do printed terms give you any insight into the topic of the respective articles?

*Some are more transparent than others.*

In [15]:
for i in range(1, 9):
    print(df_tf_idf[[f'Article {i}']].idxmax())

Article 1    fare
dtype: object
Article 2    hong
dtype: object
Article 3    sugar
dtype: object
Article 4    petrol
dtype: object
Article 5    engine
dtype: object
Article 6    australia
dtype: object
Article 7    car
dtype: object
Article 8    railway
dtype: object
