## ISBI Laboration 1: Introduction to Word2Vec with Python

Inspired by ChatGPT, corrected, updated and extended by Hercules Dalianis, Nov 2023

Help with dataframes and visualisation by Laleh Davoodi.

### Objective:

To understand the concept of Word2Vec.
To learn how to train a Word2Vec model using Python.
To explore the vector representations of words.

### Data:
Reuters news corpus from 1996-1997

### Prerequisites:
Jupyter Notebook, if it is not installed, you can use Google Colab instead.

### Setup:
If you haven't already, install Jupyter Notebook or open Google Colab.

### Introduction:
Word2Vec is a popular technique for word embeddings, which represent words as dense vectors in a continuous space. These vectors capture semantic relationships between words, allowing us to perform various NLP tasks. In this lab, we will learn the basics of Word2Vec.

Start Jupiter notebook or Google Colab. In Google Colab you must upload your this file.

In [1]:
# Import necessary libraries
from gensim.models import Word2Vec
import nltk
from nltk.corpus import reuters
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Exercise 1: Loading and Preprocessing Text Data

In [2]:
# Download and install new corpus Reuters
nltk.download('reuters')
# Observe the Reuters news corpus is from 1996-1997

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [3]:
# Count words of corpus using the internal NLTK format.
words = nltk.corpus.reuters.words()

# Calculate length of words list and print it
len(words)

1720901

In [4]:
# Tokenize the Reuters corpus. 
# Create lists of tokenised sentences, which is the input form of word2vec 
sentences = reuters.sents()

len(sentences)

54716

In [5]:
# Print the first three sentences, the first list starts with 0.
print(sentences[0])
print(sentences[1])
print(sentences[2])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.']
['But', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-', 'run', ',', 'in', 'the', 'short', '-', 'term', 'Tokyo', "'", 's', 'loss', 'might', 'be', 'their', 'gain', '.']


#### Question 1: What is the purpose of tokenizing sentences into words?

Exercise 2: Training a Word2Vec Model

In [6]:
# Train a Word2Vec model with Reuters corpus
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

Now we have built a word2vec model of the Reuters corpus

In [7]:
# Find vector for a word
vector = model.wv['economy']

# Print the vector for the word
print(vector)

[ 1.1274002e+00  3.3547398e-01 -2.6488962e+00  1.0882885e+00
 -2.2852233e-01  2.1810293e+00  1.5596350e+00  5.3422970e-01
  4.8847660e-01 -3.3951962e-01  5.9733278e-01  5.0753248e-01
  2.7595784e-03  4.0171045e-01 -8.5446984e-01 -6.3429856e-01
  6.7104977e-01  3.1843475e-01 -3.4750912e-01  4.6754843e-01
 -6.4655066e-01 -2.0288999e+00 -1.2504990e-01 -1.4670138e+00
 -3.2082546e-01 -3.1014001e-01  1.5543903e+00  4.4491401e-01
 -5.1107454e-01  4.0817320e-01 -2.7206286e-03 -2.8651816e-01
  3.6445722e-01 -5.1649845e-01 -6.0204089e-01  1.0714074e+00
 -1.2225339e+00 -1.3258578e-01 -7.3834115e-01 -4.2011398e-01
  8.3323622e-01  1.0994993e+00  1.7861873e-01 -1.4300352e+00
  2.7773390e+00  4.6924430e-01 -2.4053396e-01 -5.6741583e-01
  1.8283899e-01 -8.3035010e-01  1.5184389e+00  1.5438917e-01
 -3.2820508e-01 -2.4104278e-01 -6.2988788e-01  3.0780223e-01
 -9.4283089e-02  5.6192529e-01  7.0515126e-01 -1.5912230e-01
  1.5541530e+00 -1.4448928e+00  4.8916999e-01 -8.9810944e-01
 -5.1414216e-01 -1.44652

#### Question 2: What does these numbers above mean? How are they correlated to "vector_size". What does the parameter "vector_size" do when training a Word2Vec model? What does "window" represent?

In [8]:
# Find words most similar to a given word
# Just replace the word 'economy' with an other word and rerun the cell.
similar_words = model.wv.most_similar('economy')

print(similar_words)

[('economic', 0.8603454232215881), ('strength', 0.8466590046882629), ('situation', 0.8440170288085938), ('policies', 0.8293824791908264), ('policy', 0.8153066039085388), ('monetary', 0.8127627968788147), ('political', 0.8065685629844666), ('aggregates', 0.8055894374847412), ('country', 0.7938932776451111), ('growing', 0.7895318865776062)]


#### Question 3: How close semantically to 'economy' are the similar words of word2vec? 
Try some other words. How well does the word2vec work for them? 

#### Question 4: What does the number on the right side word mean? If the number is "1" what does that mean? (Clue think about angle between vectors)

Exercise 3: Exploring Word Similarity


In [9]:
# Find words most similar to a given word
# Some more word to test, just replace 'economy' with one of them and rerun the cell.
# Europe, oil, export, economy, business, agricultural, Russia, China, president, Queen, unemployment, computer
similar_words = model.wv.most_similar('economy')

print(similar_words)

[('economic', 0.8603454232215881), ('strength', 0.8466590046882629), ('situation', 0.8440170288085938), ('policies', 0.8293824791908264), ('policy', 0.8153066039085388), ('monetary', 0.8127627968788147), ('political', 0.8065685629844666), ('aggregates', 0.8055894374847412), ('country', 0.7938932776451111), ('growing', 0.7895318865776062)]


#### Question 5: What do you see in the example above with similarity to 'Economy'. Are there any strange similarities? Please discuss.

Other impressions?

Exercise 4: Exploring the preprocessing of the input data.
1) Removing noise from the corpus such as interpunctions, etc

2) Make the text in lower casing

3) Lemmatise the text

Here we will use pandas dataframe for faster preprocessing. We tried to use listprocessing, but it was too slow and crashed in the end, in total 57 000 sentences. 
The dataframes gives 10 780 rows (with more than on sentence per row).

In [10]:
# Getting file ids
fileids = reuters.fileids()

In [11]:
# Import pandas library
import pandas as pd
# Load data into pandas dataframe
all_reuters_words = []
for file_id in fileids:
    file_words = reuters.words(file_id)
    output = " ".join(file_words)
    all_reuters_words.append(output)
data = {"sentence": all_reuters_words}
df = pd.DataFrame.from_dict(data)
# make an extra duplicate column
df['original_sentence'] = df['sentence']


In [12]:
df.shape

(10788, 2)

In [13]:
# Show the the first 10 rows of the data set
df.head(10)

Unnamed: 0,sentence,original_sentence
0,ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA...,ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA...
1,CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN S...,CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN S...
2,JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWN...,JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWN...
3,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Tha...,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Tha...
4,INDONESIA SEES CPO PRICE RISING SHARPLY Indone...,INDONESIA SEES CPO PRICE RISING SHARPLY Indone...
5,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...
6,INDONESIAN COMMODITY EXCHANGE MAY EXPAND The I...,INDONESIAN COMMODITY EXCHANGE MAY EXPAND The I...
7,SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE F...,SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE F...
8,WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRA...,WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRA...
9,SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...,SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...


In [14]:
# Displaying the 3rd row of the 'sentence dataframe
print(df.loc[2, 'sentence'])

JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and Industry ( MITI ) will revise its long - term energy supply / demand outlook by August to meet a forecast downtrend in Japanese energy demand , ministry officials said . MITI is expected to lower the projection for primary energy supplies in the year 2000 to 550 mln kilolitres ( kl ) from 600 mln , they said . The decision follows the emergence of structural changes in Japanese industry following the rise in the value of the yen and a decline in domestic electric power demand . MITI is planning to work out a revised energy supply / demand outlook through deliberations of committee meetings of the Agency of Natural Resources and Energy , the officials said . They said MITI will also review the breakdown of energy supply sources , including oil , nuclear , coal and natural gas . Nuclear energy provided the bulk of Japan ' s electric power in the fiscal year ended March 31 , supplying an estimated 

In [15]:
# remove all noise hence all non characters, such .,;() etc, in this case 
# everything except alphanumerical characters.
# library for regular expressions
import re
df['sentence'] = df['sentence'].apply(lambda x: " ".join([re.sub("[^A-Za-z]+","", x) for x
in nltk.word_tokenize(x)]))

In [16]:
# Displaying the 3rd row after removing noise
print(df.loc[2, 'sentence'])

JAPAN TO REVISE LONG  TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and Industry  MITI  will revise its long  term energy supply  demand outlook by August to meet a forecast downtrend in Japanese energy demand  ministry officials said  MITI is expected to lower the projection for primary energy supplies in the year  to  mln kilolitres  kl  from  mln  they said  The decision follows the emergence of structural changes in Japanese industry following the rise in the value of the yen and a decline in domestic electric power demand  MITI is planning to work out a revised energy supply  demand outlook through deliberations of committee meetings of the Agency of Natural Resources and Energy  the officials said  They said MITI will also review the breakdown of energy supply sources  including oil  nuclear  coal and natural gas  Nuclear energy provided the bulk of Japan  s electric power in the fiscal year ended March   supplying an estimated  pct on a kilowatt  hour basis  f

In [17]:
# Convert all data into lower case
df['sentence'] = df['sentence'].apply(lambda x: " ".join(x.lower() for x in str(x).split()))


In [18]:
# Displaying the 3rd after converting to lower case
print(df.loc[2, 'sentence'])

japan to revise long term energy demand downwards the ministry of international trade and industry miti will revise its long term energy supply demand outlook by august to meet a forecast downtrend in japanese energy demand ministry officials said miti is expected to lower the projection for primary energy supplies in the year to mln kilolitres kl from mln they said the decision follows the emergence of structural changes in japanese industry following the rise in the value of the yen and a decline in domestic electric power demand miti is planning to work out a revised energy supply demand outlook through deliberations of committee meetings of the agency of natural resources and energy the officials said they said miti will also review the breakdown of energy supply sources including oil nuclear coal and natural gas nuclear energy provided the bulk of japan s electric power in the fiscal year ended march supplying an estimated pct on a kilowatt hour basis followed by oil pct and lique

In [19]:
# Remove the stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
print(stop)
df['sentence'] = df['sentence'].apply(lambda x: " ".join([x for x in x.split() if x not in stop]))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# displaying the 3rd after removing stopwords
print(df.loc[2, 'sentence'])

japan revise long term energy demand downwards ministry international trade industry miti revise long term energy supply demand outlook august meet forecast downtrend japanese energy demand ministry officials said miti expected lower projection primary energy supplies year mln kilolitres kl mln said decision follows emergence structural changes japanese industry following rise value yen decline domestic electric power demand miti planning work revised energy supply demand outlook deliberations committee meetings agency natural resources energy officials said said miti also review breakdown energy supply sources including oil nuclear coal natural gas nuclear energy provided bulk japan electric power fiscal year ended march supplying estimated pct kilowatt hour basis followed oil pct liquefied natural gas pct noted


### Question 6. When you look at the text what can you see have been carried out?

Exercise 5: Lemmatisation of words.

In [21]:
# Below is the lemmatization function written.
from nltk import pos_tag
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
    
lemmatizer = WordNetLemmatizer()

# penn2morphy function is a utility for converting specific Penn Treebank part-of-speech
# tags into their WordNet equivalents, and defaulting to 'noun' if the tag is not found 
# in its conversion dictionary.
# the lemmatiser needs to get instructions of which word classes to lemmatise, 
# here we have chosen noun, NN, verb, VB, adjective, JJ, and adverb, RB

def penn2morphy(penntag):
    morphy_tag = {'NN':'n', 'JJ':'a',
                 'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n'


# lemmatizer function
def lemmatize_sent(text):
    # Text input is string.
    return [lemmatizer.lemmatize(word, pos=penn2morphy(tag))
            for word, tag in pos_tag(nltk.word_tokenize(text))]



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [22]:
# Example to test the lemmatiser
print(lemmatize_sent('Mazda is a better car than Toyota. He loves being here in these lovely cities. This is not economically sustainable. He has different viewpoints'))

# write a interesting sentence and test the lemmatiser, not too short, not too long
print(lemmatize_sent('The quick brown foxes are jumping over the lazy dogs. Meanwhile, a wise owl watches from a nearby branch. The forest is alive with the sounds of rustling leaves and chirping crickets as the sun sets in the distance.'))   

['Mazda', 'be', 'a', 'good', 'car', 'than', 'Toyota', '.', 'He', 'love', 'be', 'here', 'in', 'these', 'lovely', 'city', '.', 'This', 'be', 'not', 'economically', 'sustainable', '.', 'He', 'have', 'different', 'viewpoint']
['The', 'quick', 'brown', 'fox', 'be', 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Meanwhile', ',', 'a', 'wise', 'owl', 'watch', 'from', 'a', 'nearby', 'branch', '.', 'The', 'forest', 'be', 'alive', 'with', 'the', 'sound', 'of', 'rustle', 'leaf', 'and', 'chirp', 'cricket', 'a', 'the', 'sun', 'set', 'in', 'the', 'distance', '.']


### Question 6. Write your own sentence to try out the lemmatizer. Does it work properly?

In [23]:
# Perform lemmatization on all sentences
df['sentence'] = df['sentence'].apply(lambda x: " ".join(lemmatize_sent(x)))

In [24]:
# Install som libraries at operating system level
!pip3 install --user Ipython



In [25]:
# Function to pretty print the sentences.
import IPython
from IPython.display import display, HTML

# displaying the 2 first sentences after lemmatizing and comparing with original sentence
display(HTML(df.loc[:1, ['sentence', 'original_sentence']].to_html()))

Unnamed: 0,sentence,original_sentence
0,asian exporter fear damage u japan rift mount trade friction u japan raise fear among many asia export nation row could inflict far reach economic damage businessmen official say told reuter correspondent asian capital u move japan might boost protectionist sentiment u lead curb american import product exporter say conflict would hurt long run short term tokyo loss might gain u say impose mln dlrs tariff import japanese electronics good april retaliation japan allege failure stick pact sell semiconductor world market cost unofficial japanese estimate put impact tariff billion dlrs spokesman major electronics firm say would virtually halt export product hit new tax able business say spokesman lead japanese electronics firm matsushita electric industrial co ltd lt mc tariff remain place length time beyond month mean complete erosion export good subject tariff u say tom murtha stock analyst tokyo office broker lt james capel co taiwan businessmen official also worry aware seriousness u threat japan serve warn u say senior taiwanese trade official ask name taiwan trade trade surplus billion dlrs last year pct u surplus help swell taiwan foreign exchange reserve billion dlrs among world large must quickly open market remove trade barrier cut import tariff allow import u product want defuse problem possible u retaliation say paul sheen chairman textile exporter lt taiwan safe group senior official south korea trade promotion association say trade dispute u japan might also lead pressure south korea whose chief export similar japan last year south korea trade surplus billion dlrs u billion dlrs malaysia trade officer businessmen say tough curb japan might allow hard hit producer semiconductor third country expand sale u hong kong newspaper allege japan selling cost semiconductor electronics manufacturer share view businessmen say short term commercial advantage would outweigh u pressure block import short term view say lawrence mill director general federation hong kong industry whole purpose prevent import one day extend source much serious hong kong disadvantage action restrain trade say u last year hong kong big export market account pct domestically produce export australian government await outcome trade talk u japan interest concern industry minister john button say canberra last friday kind deterioration trade relation two country major trade partner serious matter button say say australia concern centre coal beef australia two large export japan also significant u export country meanwhile u japanese diplomatic manoeuvre solve trade stand continue japan ruling liberal democratic party yesterday outline package economic measure boost japanese economy measure propose include large supplementary budget record public work spend first half financial year also call step spending emergency measure stimulate economy despite prime minister yasuhiro nakasone avow fiscal reform program deputy u trade representative michael smith makoto kuroda japan deputy minister international trade industry miti due meet washington week effort end dispute,"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products . But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain . The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost . Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes . "" We wouldn ' t be able to do business ,"" said a spokesman for leading Japanese electronics firm Matsushita Electric Industrial Co Ltd & lt ; MC . T >. "" If the tariffs remain in place for any length of time beyond a few months it will mean the complete erosion of exports ( of goods subject to tariffs ) to the U . S .,"" said Tom Murtha , a stock analyst at the Tokyo office of broker & lt ; James Capel and Co >. In Taiwan , businessmen and officials are also worried . "" We are aware of the seriousness of the U . S . Threat against Japan because it serves as a warning to us ,"" said a senior Taiwanese trade official who asked not to be named . Taiwan had a trade trade surplus of 15 . 6 billion dlrs last year , 95 pct of it with the U . S . The surplus helped swell Taiwan ' s foreign exchange reserves to 53 billion dlrs , among the world ' s largest . "" We must quickly open our markets , remove trade barriers and cut import tariffs to allow imports of U . S . Products , if we want to defuse problems from possible U . S . Retaliation ,"" said Paul Sheen , chairman of textile exporters & lt ; Taiwan Safe Group >. A senior official of South Korea ' s trade promotion association said the trade dispute between the U . S . And Japan might also lead to pressure on South Korea , whose chief exports are similar to those of Japan . Last year South Korea had a trade surplus of 7 . 1 billion dlrs with the U . S ., Up from 4 . 9 billion dlrs in 1985 . In Malaysia , trade officers and businessmen said tough curbs against Japan might allow hard - hit producers of semiconductors in third countries to expand their sales to the U . S . In Hong Kong , where newspapers have alleged Japan has been selling below - cost semiconductors , some electronics manufacturers share that view . But other businessmen said such a short - term commercial advantage would be outweighed by further U . S . Pressure to block imports . "" That is a very short - term view ,"" said Lawrence Mills , director - general of the Federation of Hong Kong Industry . "" If the whole purpose is to prevent imports , one day it will be extended to other sources . Much more serious for Hong Kong is the disadvantage of action restraining trade ,"" he said . The U . S . Last year was Hong Kong ' s biggest export market , accounting for over 30 pct of domestically produced exports . The Australian government is awaiting the outcome of trade talks between the U . S . And Japan with interest and concern , Industry Minister John Button said in Canberra last Friday . "" This kind of deterioration in trade relations between two countries which are major trading partners of ours is a very serious matter ,"" Button said . He said Australia ' s concerns centred on coal and beef , Australia ' s two largest exports to Japan and also significant U . S . Exports to that country . Meanwhile U . S .- Japanese diplomatic manoeuvres to solve the trade stand - off continue . Japan ' s ruling Liberal Democratic Party yesterday outlined a package of economic measures to boost the Japanese economy . The measures proposed include a large supplementary budget and record public works spending in the first half of the financial year . They also call for stepped - up spending as an emergency measure to stimulate the economy despite Prime Minister Yasuhiro Nakasone ' s avowed fiscal reform program . Deputy U . S . Trade Representative Michael Smith and Makoto Kuroda , Japan ' s deputy minister of International Trade and Industry ( MITI ), are due to meet in Washington this week in an effort to end the dispute ."
1,china daily say vermin eat pct grain stock survey province seven city show vermin consume seven pct china grain stock china daily say also say year mln tonne pct china fruit output leave rot mln tonne pct vegetable paper blame waste inadequate storage bad preservation method say government launch national programme reduce waste call improve technology storage preservation great production additives paper give detail,"CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN STOCKS A survey of 19 provinces and seven cities showed vermin consume between seven and 12 pct of China ' s grain stocks , the China Daily said . It also said that each year 1 . 575 mln tonnes , or 25 pct , of China ' s fruit output are left to rot , and 2 . 1 mln tonnes , or up to 30 pct , of its vegetables . The paper blamed the waste on inadequate storage and bad preservation methods . It said the government had launched a national programme to reduce waste , calling for improved technology in storage and preservation , and greater production of additives . The paper gave no further details ."


In [26]:
# Show the the first 10 rows of the preprocessed data set and original dateset
df.loc[:9, ['sentence', 'original_sentence']]

Unnamed: 0,sentence,original_sentence
0,asian exporter fear damage u japan rift mount ...,ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA...
1,china daily say vermin eat pct grain stock sur...,CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN S...
2,japan revise long term energy demand downwards...,JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWN...
3,thai trade deficit widen first quarter thailan...,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Tha...
4,indonesia see cpo price rise sharply indonesia...,INDONESIA SEES CPO PRICE RISING SHARPLY Indone...
5,australian foreign ship ban end nsw port hit t...,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...
6,indonesian commodity exchange may expand indon...,INDONESIAN COMMODITY EXCHANGE MAY EXPAND The I...
7,sri lanka get usda approval wheat price food d...,SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE F...
8,western mining open new gold mine australia we...,WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRA...
9,sumitomo bank aim quick recovery merger sumito...,SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...


### Question 7. When comparing the preprocessed and the original sentences, how do you think this will affect the word2vec?

In [27]:
# Convert sentences to words and put it in "processed_sentence" column, 
# hence to make list of lists to feed the word2vec algorithm
df['processed_sentence']= df['sentence'].apply(lambda x : x.split())

In [28]:
# Displaying the 3rd row after converting to words
print(df.iloc[2]['processed_sentence'])

['japan', 'revise', 'long', 'term', 'energy', 'demand', 'downwards', 'ministry', 'international', 'trade', 'industry', 'miti', 'revise', 'long', 'term', 'energy', 'supply', 'demand', 'outlook', 'august', 'meet', 'forecast', 'downtrend', 'japanese', 'energy', 'demand', 'ministry', 'official', 'say', 'miti', 'expect', 'low', 'projection', 'primary', 'energy', 'supply', 'year', 'mln', 'kilolitres', 'kl', 'mln', 'say', 'decision', 'follow', 'emergence', 'structural', 'change', 'japanese', 'industry', 'follow', 'rise', 'value', 'yen', 'decline', 'domestic', 'electric', 'power', 'demand', 'miti', 'planning', 'work', 'revise', 'energy', 'supply', 'demand', 'outlook', 'deliberation', 'committee', 'meeting', 'agency', 'natural', 'resource', 'energy', 'official', 'say', 'say', 'miti', 'also', 'review', 'breakdown', 'energy', 'supply', 'source', 'include', 'oil', 'nuclear', 'coal', 'natural', 'gas', 'nuclear', 'energy', 'provide', 'bulk', 'japan', 'electric', 'power', 'fiscal', 'year', 'end', 'ma

In [29]:
# get clean data (noise reduction, lower case, stop word removal and lemmatization)
cleaned_sentences = df['processed_sentence']

In [30]:
cleaned_sentences

0        [asian, exporter, fear, damage, u, japan, rift...
1        [china, daily, say, vermin, eat, pct, grain, s...
2        [japan, revise, long, term, energy, demand, do...
3        [thai, trade, deficit, widen, first, quarter, ...
4        [indonesia, see, cpo, price, rise, sharply, in...
                               ...                        
10783    [u, k, money, market, shortage, forecast, revi...
10784    [knight, ridder, inc, lt, krn, set, quarterly,...
10785    [technitrol, inc, lt, tnl, set, quarterly, qtl...
10786    [nationwide, cellular, service, inc, lt, ncel,...
10787    [lt, h, automotive, technology, corp, year, ne...
Name: processed_sentence, Length: 10788, dtype: object

#### Question 8. What about stop word filtering or lemmatisation, how does that effect the similarity scores? Are they better or worse? Please discuss. 

In [31]:
cleaned_model = Word2Vec(cleaned_sentences, vector_size=100, window=5, min_count=1, sg=0)
# Find vector for a word
vector = cleaned_model.wv['economy']

# Print the vector for the word
print(vector)


[-2.6237828e-01  1.0334674e+00 -1.8104994e+00  6.8121731e-01
  1.8120116e-01  2.8664717e-02  6.9259387e-01  1.8393155e+00
  1.4036654e+00  8.9888114e-01 -5.0176746e-01 -1.1607921e+00
 -1.1626451e+00 -6.7188567e-03 -2.8223148e-01  3.7768204e-02
  5.1969022e-01 -1.2591678e+00 -3.8782351e-02 -7.9205155e-02
  1.3112509e+00 -5.3181738e-01 -8.1847352e-01  1.5476577e+00
 -1.7563993e+00 -7.7186489e-01 -5.0444824e-01  3.7775013e-01
 -8.6114496e-01  7.3197114e-01  1.3522334e+00 -1.4682155e+00
  1.6402821e-01  1.0172895e+00 -1.5423242e+00 -7.0448697e-01
  1.8420955e+00 -1.9724704e+00  8.6732101e-01 -7.9823327e-01
  1.3440945e+00  6.5371245e-01 -6.4129198e-01 -5.1420373e-01
 -2.7793720e-01 -2.2636591e-01 -1.3119627e+00 -6.6612996e-02
  1.7363222e+00  3.3171928e-01  1.3110876e+00 -1.7070427e+00
 -9.0783173e-01 -1.2397906e-01  7.4256390e-01 -2.3367156e-01
  1.0906454e+00 -4.1981736e-01 -1.9774607e+00 -2.9199052e-01
 -1.5975155e+00  1.7213882e+00  6.6298312e-01 -2.3358831e+00
 -1.2878349e+00 -1.42884

In [32]:
# Find words most similar to a given word
# remember that all words are now in lower case.
# europe, oil, export, economy, business, agricultural, russia, china, president, queen, unemployment, computer
similar_words = cleaned_model.wv.most_similar('economy')

print(similar_words)

[('recession', 0.8983740210533142), ('economic', 0.897454559803009), ('slow', 0.8885834813117981), ('prospect', 0.8729284405708313), ('stimulate', 0.8723751902580261), ('inflationary', 0.8711614012718201), ('pressure', 0.8654695749282837), ('turn', 0.8647941946983337), ('sustain', 0.8580414652824402), ('external', 0.8543429970741272)]


Exercise 6: Visualizing Word Vectors.

Here we create visualizations of word vectors. We install some additional libraries like pyplot / plotly.

In [33]:
from sklearn.decomposition import PCA
from matplotlib import pyplot

# all unique words in Reuters corpora that will make a word cloud that is diffuclut to investigate.
#unique_words = list(set(word for sentence in cleaned_sentences for word in sentence))

# some fewer words that are easy to display and investigate
unique_words = ['president', 'chairman','dollar', 'money','rate','oil', 'trade','fuel', 'stock','recovery']

# Retrieve the word vectors for only the unique words
# not preprocessed word2model
# X = [model.wv[word] for word in unique_words if word in model.wv]
# preprocessed word2model
# X = [cleaned_model.wv[word] for word in unique_words if word in cleaned_model.wv]
# preprocessed word2model with 200 words of all unique words
# X = [cleaned_model.wv[word] for word in unique_words[100:200] if word in model.wv]
X = [cleaned_model.wv[word] for word in unique_words if word in cleaned_model.wv]

print(len(unique_words))
# Fit a 2d PCA model to the vectors
pca = PCA(n_components=2)
result = pca.fit_transform(X)

10


In [34]:
#pip install plotly
import plotly.graph_objs as go

# Extract coordinates
x_coords = result[:, 0]
y_coords = result[:, 1]

# Create a scatter plot
scatter = go.Scatter(x=x_coords, y=y_coords, mode='markers+text', text=unique_words, textposition='top center')

# Define layout
layout = go.Layout(
    title='Unique Word Embeddings',
    xaxis=dict(title='PCA Dimension 1'),
    yaxis=dict(title='PCA Dimension 2')
)

# Define figure and plot
fig = go.Figure(data=[scatter], layout=layout)
fig.show()

#### Question 9. Try to comment out the stop word filtering or lemmatisation and build a new word2vec model. How are the results now?

You can do it easily by use the old un-processed model
X = [model.wv[word] for word in unique_words if word in model.wv]
and then visualise it above or just add some code here below and run it again.


In [42]:
from sklearn.decomposition import PCA
from matplotlib import pyplot

# all unique words in Reuters corpora that will make a word cloud that is diffuclut to investigate.
#unique_words = list(set(word for sentence in cleaned_sentences for word in sentence))

# some fewer words that are easy to display and investigate
# unique_words = ['president', 'chairman','dollar', 'money','rate','oil', 'trade','fuel', 'stock','recovery']
unique_words = ['industry', 'product','cost', 'domestic','conflict','billion', 'chairman','minister', 'money','stimulate']


# Retrieve the word vectors for only the unique words
# not preprocessed word2model
X = [cleaned_model.wv[word] for word in unique_words if word in cleaned_model.wv]


print(len(unique_words))
# Fit a 2d PCA model to the vectors
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# Extract coordinates
x_coords = result[:, 0]
y_coords = result[:, 1]

# Create a scatter plot
scatter = go.Scatter(x=x_coords, y=y_coords, mode='markers+text', text=unique_words, textposition='top center')

# Define layout
layout = go.Layout(
    title='Unique Word Embeddings',
    xaxis=dict(title='PCA Dimension 1'),
    yaxis=dict(title='PCA Dimension 2')
)

# Define figure and plot
fig = go.Figure(data=[scatter], layout=layout)
fig.show()

10


### Question 10 What do you think about the visual representation, please play around with the words and check for relationsships

Conclusion:
In this lab, you've learned the basics of Word2Vec, including training a Word2Vec model, exploring word similarity, and understanding word analogy. Word2Vec is a powerful tool for natural language processing tasks and has various applications in text analysis.
