## Assignment 3 ##

Your Name: Amaris Efthimou

## Assignment Question ##

Using the below data set, perform LDA topic modeling to identify multiple latent topics inside.

The number of topipcs is not fixed, it is up to you to decide how many topics to go with.

## Grading Guidelines: ##

You need to show all the steps (Codes & outputs) from uploading the data set to performing topic modeling to derive topics with keywords.

DO NOT CLEAR THE OUTPUTS (Leave the outputs printed).


## Step 1: Load the dataset

The dataset we'll use is a list of news headlines published over a period of 15 years. 

We'll start by loading it from the `abcnews-date-text.csv` file.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#connect Colab to your Google Drive.
from google.colab import drive
import os
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
news = pd.read_csv('/content/gdrive/My Drive/abcnews-date-text.csv')

In [4]:
news.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


## Step 2: Data Preprocessing ##

We will perform the following steps:

The order of the pre-processing steps doesn't have to be in this way.

It is up to you whether you start tokenizing first or other processing steps or at the same time.

HOWEVER, make sure that all the below steps are performed and applied to the headline text.

* **Tokenization** 
* **Lowercasing** 
* **remove punctuations**
* **Words that have fewer than 3 characters are removed**
* **stopwords are removed**
* **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.

**Lemmatization code is give below, use the below code for lemmatization.**

In [5]:
import nltk
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import numpy as np
import string as str
import gensim
from gensim.utils import simple_preprocess
np.random.seed(400)

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
print(WordNetLemmatizer().lemmatize('did', pos = 'v'))

do


In [7]:
from nltk.corpus import stopwords

lem = WordNetLemmatizer()
sw = set(stopwords.words('english'))
def pp(text):
    ans = []
    for token in simple_preprocess(text):
        if token not in sw and len(token) > 2:
            ans.append(lem.lemmatize(token, pos='v'))
    return ans
news['headline_text2'] = news['headline_text'].apply(pp)

In [8]:
news.head()

Unnamed: 0,publish_date,headline_text,headline_text2
0,20030219,aba decides against community broadcasting lic...,"[aba, decide, community, broadcast, licence]"
1,20030219,act fire witnesses must be aware of defamation,"[act, fire, witness, must, aware, defamation]"
2,20030219,a g calls for infrastructure protection summit,"[call, infrastructure, protection, summit]"
3,20030219,air nz staff in aust strike for pay rise,"[air, staff, aust, strike, pay, rise]"
4,20030219,air nz strike to affect australian travellers,"[air, strike, affect, australian, travellers]"


## Step 3: Bag of words on the dataset

* 3-1. Dictionary

Create a dictionary from pre-processed headline texts containing the number of times a word appears in the training set. 

To do that, let's pass your pre-processed headline texts to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [9]:
dictionary = gensim.corpora.Dictionary()

In [10]:
from gensim.corpora import Dictionary
headlines = news['headline_text2'].tolist()
dictionary = Dictionary(headlines)

* 3-2. Gensim filter_extremes

[`filter_extremes(no_below=i, no_above=j, keep_n=k) where i,j,k can be integers or fractions.`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [11]:
i = 5 #no below
j = 0.5 #no avove
k = 10000 #keep n
dictionary.filter_extremes(no_below=i, no_above=j, keep_n=k)

* 3-3. Gensim doc2bow

* Gensim doc2bow (pass the tokenized words to doc2bow and convert those to vectors.)

* Caution: No further preprocessing should be done such as tokenization, lemmatization, and etc before initiating this.

In [12]:
#code for step 3-3
corpus = [dictionary.doc2bow(headline) for headline in headlines]

## Step 4: Running LDA using Bag of Words ##

Perform LDA model on your final corpus.


In [13]:
#Run LDA model on the final corpus.

#num_topics: the number of latent topics to be extracted from the corpus.
#id2word: mapping from word ids (integers) to words (strings).
# Some other parameters. See the document explanations for more details.

#code for step 4.

In [14]:
#you may use multiple code blocks.

In [17]:
from gensim.models import LdaModel
num_topics = 10  # Number of topics to be extracted
passes = 5  # Number of passes through the corpus during training
iterations = 10  # Maximum number of iterations for each training pass
random_state = 42  # Random seed for reproducibility
lda_model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=passes, iterations=iterations, random_state=random_state)

# Print topics and top 10 words in each topic
topics = lda_model.show_topics(num_topics=num_topics, num_words=5)
for topic in topics:
    print(f"Topic {topic[0]}: {' '.join(word[0] for word in lda_model.show_topic(topic[0], topn=10))}")

Topic 0: day north one south use australia final war korea victoria
Topic 1: canberra coast miss mine service west job bank gold cut
Topic 2: plan show make change help say health indigenous power new
Topic 3: australian fire attack house school year state ban tasmania trial
Topic 4: queensland perth years jail china arrest life fight deal new
Topic 5: police man trump sydney charge melbourne kill die crash woman
Topic 6: court face get accuse child time interview say tell lose
Topic 7: government election world home report tasmanian cup set pay bill
Topic 8: australia donald first say test turnbull labor leave women country
Topic 9: adelaide open market afl break league win share national concern


### Step 5: label the topics ###

Using the keywords in each topic , what topics were you able to infer?
You should write down the inferred topic labels below.


In [19]:
#Topic 0: Geographical locations & war
#Topic 1: Economic & government issues
#Topic 2: Political change & health
#Topic 3: Natural disasters & legal issues
#Topic 4: Legal issues & foreign affairs
#Topic 5: Crime & law enforcement
#Topic 6: Legal issues & court
#Topic 7: Government politics
#Topic 8: Social issues in government
#Topic 9: Sports news