<a href="https://colab.research.google.com/github/coezbek/uts-36118-anlp-2026/blob/main/Session_1_NLP_Basics_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# tinyurl.com/ANLPcolab1part2


Go to "File" -> "Save a Copy in Drive..."
This lets you create your own copy of the notebook in your Google drive, and any changes you make doesn't impact the shared notebook

## Basic text analysis using Python

The first step is to install the required libraries using the pip command (if you don't have them), and import the modules from the libraries.



In [None]:
#Enable plots to be displayed in the notebook
%matplotlib inline

!pip install seaborn

import pandas as pd
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Mounting the drive

In this notebook, I'm mounting the Google drive to read a csv file that is stored on my drive. You must allow access to your drive by signing in to your Google account.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

Download the dataset from here: https://drive.google.com/file/d/1qAC9x-WMwzofyG8l1fFUoHwWhk3n61fp/view?usp=sharing

Then, copy it to your Google drive folder which contains the notebook

In [None]:
# After executing the cell above, Drive files will be present in "/content/drive/My Drive". The below command lists the contents in the drive:
!ls "/content/drive/My Drive/Colab Notebooks/ANLP Colab Notebooks/ANLP Datasets"


## Reading Data from a CSV File

To read the data from the input csv file from my Google drive and store it as a Python dataframe, I use the read_csv() function from Pandas. You have to change the folder location to where the file is stored in your own Gdrive - mine is in this path:
/content/drive/My Drive/Colab Notebooks/ANLP Datasets/Session1_CNN_Articles_2021-2023.csv

You can read about the different functions and their input parameters in the  documentation for the library:
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)

**Note:** Comment code below if you are not importing from your Gdrive folder



In [None]:
#The input csv file is a subset of the data from https://github.com/hadasu/CNN_web_crawler
#MAKE SURE YOU CHANGE THIS FOLDER TO POINT TO THE RIGHT DIRECTORY IN YOUR GDRIVE
newsdf = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ANLP Colab Notebooks/ANLP Datasets/Session1_CNN_Articles_2021-2023.csv',encoding='unicode_escape')
newsdf.describe()

## Reading input file from a url
The alternative option is to read in the CSV from a web url (on github) and store it in a dataframe. This is a smaller dataset containing articles only from 2021 January to March.


In [None]:
url = 'https://github.com/AntonetteShibani/NLPAnalysis/blob/main/CNN_Articles_2021.csv?raw=true'
newsdf = pd.read_csv(url)

## Preliminary data inspection

We usually try to get a a sense of the data first (particularly useful for large data sets where opening in other UI based tools is not easy)

In [None]:
#Print general information about a DataFrame including the index dtype and columns, non-null values and memory usage
newsdf.info()

In [None]:
newsdf.rename(columns={'Unnamed: 0': 'ID'}, inplace=True)

In [None]:
#Generate descriptive statistics that summarizes the central tendency, dispersion and shape of a datasetâ€™s distribution
newsdf.describe()

In [None]:
# Use the .head(n) function to look at the first 'n' rows of our news dataframe. The default n is 5, we are now changing it to view the first 10 rows
newsdf.head(10)


In [None]:
#A function similar to above, but provides a random sample of rows rather than the first few.
newsdf.sample(5)

## Word Count

Word counts are simple but useful indicators for asking questions on the length of texts.

To demonstrate usage, we see how the metrics are calculated for one sample sentence from the dataset.

In [None]:
s = newsdf['headline'][2]
print(s)

#Splitting by whitespace characters and calculating the length. Note that punctuation marks are also counted as words
len(s.split())

In [None]:
#To make it easier to reuse in the future, we can create a function that returns word count
def word_count(text):
    wc = len(text.split())
    return wc

Now now we can apply the word_count function to our text variable to create a new variable with the number of words in the news article text.

In [None]:
newsdf['article_word_count'] = newsdf['text'].apply(word_count)

We can use describe, hist, and scatter functions to provide some information on the length of articles in our dataset

In [None]:
newsdf['article_word_count'].describe()

In [None]:
newsdf['article_word_count'].hist(bins = 10)

In [None]:
sns.boxplot(x = "part_of",
            y = "article_word_count",
            palette='Set3',
            data =newsdf);

In [None]:
#I'm using a function that populates bar graph from a dataframe variable
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords

def wordBarGraphFunction(df,column,title):
    topic_words = [ z.lower() for y in
                       [ x.split() for x in df[column] if isinstance(x, str)]
                       for z in y]
    word_count_dict = dict(Counter(topic_words))
    popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)
    popular_words_nonstop = [w for w in popular_words if w not in stopwords.words("english")]
    plt.barh(range(50), [word_count_dict[w] for w in reversed(popular_words_nonstop[0:50])])
    plt.yticks([x + 0.5 for x in range(50)], reversed(popular_words_nonstop[0:50]))
    plt.title(title)
    plt.show()

In [None]:
plt.figure(figsize=(10,10))
wordBarGraphFunction(newsdf,'headline',"Most frequent words in news article headlines (Jan-Mar 2021)")

We can further explore the articles which are of the longest and shortest lengths

In [None]:
#shortest
newsdf.sort_values(by='article_word_count').head(10)

In [None]:
#longest
newsdf.sort_values(by='article_word_count', ascending=False).head(10)

You can then examine the content of individual articles to gain additional insight as needed.

## Word frequencies

Word frequencies (counting how often words occur) is a critical step in quantifying texts for many kinds of text analysis. There are inbuilt functions in Python that can compute words frequencies.

Note that this analysis disregards the word order in the original sentence, taking a bag-of-words approach.


Calculate frequencies to determine the most common words in the corpus

In [None]:
# converting series to string
article_text = newsdf['text'].to_string()

#create word tokens
tokenized_words=word_tokenize(article_text)

In [None]:
all_words=nltk.FreqDist(tokenized_words)
all_words.plot(10);
print(all_words.most_common(20))

## Create a word cloud to show most common words

Note: There are so many ways in which you can customise word clouds for display, check out the documentation and read related blogs posts to try different combinations. Here, I'm using the wordcloud package to create a word cloud from the given article text.

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(max_words=100).generate(article_text)

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

You will notice that the most frequent terms are stopwords and punctuations, let's try recalculating frequencies after performing some basic cleaning.

In [None]:
# converting article text to lowercase as Python is case-sensitive
article_text_lower = article_text.lower()

#create word tokens
tokenized_words=word_tokenize(article_text_lower)

#Set up stop words for removal
nltk.download('stopwords')
from nltk.corpus import stopwords
#stopwords
stop_words=stopwords.words("english")
print(stop_words)
#Add custom stopwords to the list
stop_words.extend(["cnn", "'s", "a", "the"])

In [None]:
#Create a new variable to store filtered tokens
filtered_tokens=[]
for w in tokenized_words:
    if w not in stop_words:
         #add all filtered tokens excluding stopwords in this list below
         filtered_tokens.append(w)

import string
# punctuations
punctuations=list(string.punctuation)

#Add custom punctuations to the list by running pre-processing steps with the data set and adding relevant ones
punctuations.append("...")
punctuations.append("``")
punctuations.append("''")

print("List of punctuations to remove:\n")
print(punctuations)

#Create another variable to store all clean tokens
filtered_tokens_clean=[]
for i in filtered_tokens:
    if i not in punctuations:
        filtered_tokens_clean.append(i)

Now that we have cleaned the input text, let's calculate frequencies again to view the most common words.

In [None]:
all_words=nltk.FreqDist(filtered_tokens_clean)
all_words.plot(10);
print(all_words.most_common(20))

Let's generate the word cloud again with the cleaned set of words. Here, I'm creating the word cloud from the word frequencies we calculated from the last step (rather than passing the entire text).

In [None]:
# Convert list of tuples to dictionary
word_freq = dict(all_words)

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='black').generate_from_frequencies(word_freq)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Exercise: What are the insights from here? What do the key words indicate?